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Abstract 

Identity  management  entails  monitoring  people  in  a  shopping  mall,  airport,  or  other  large  area  with 
a  set  of  video  cameras,  and  determining  automatically  who  is  where  at  all  times.  We  developed  a  math¬ 
ematical  formulation  and  a  set  of  algorithms  that  make  significant  strides  towards  a  practical  system  for 
identity  management.  At  the  core  of  the  formulation  lies  a  single  binary  integer  program  that  describes 
the  key  data  association  problem:  Nodes  in  a  graph  correspond  to  observations,  and  edges  are  weighted 
with  correlation  measures  that  quantify  positive  or  negative  evidence  for  the  hypothesis  that  two  nodes 
correspond  to  observations  of  the  same  person.  The  binary  integer  program  defines  a  partition  of  the 
nodes  into  sets  that  are  meant  to  correspond  to  distinct  identities.  Solving  this  problem  is  NP-hard,  and 
we  developed  a  problem  decomposition  method  that,  while  losing  optimality  guarantees,  show  good  em¬ 
pirical  performance  at  near  frame-rate.  To  evaluate  our  method  and  establish  a  baseline  for  future  work 
by  us  and  others,  we  developed  a  large  video  data  set  with  more  than  1  million  frames  and  more  than 
2000  identities  observed  from  eight  cameras  placed  on  the  campus  of  Duke  University.  The  data  set  is 
fully  annotated,  and  a  3D  trajectory  is  available  for  each  person  in  every  frame  from  every  camera.  We 
also  formulated  a  new  methodology  for  performance  evaluation  in  identity  management. 

To  achieve  these  results,  one  principal  investigator  worked  with  9  students  to  publish  17  articles 
in  top  venues  in  computer  vision  and  image  processing.  The  sum  of  this  work  constitutes  a  significant 
contribution  to  the  field.  In  addition,  three  of  the  students  graduated  with  a  PhD  based  on  their  work  under 
the  project.  Two  more  earned  a  Master  degree,  an  additional  two  are  still  working  towards  their  PhDs, 
and  two  undergraduates  received  their  first  exposure  to  advanced  computer  science  research  through 
their  work  for  this  project.  New  collaborations  were  established  with  research  organizations  in  Mexico 
and  Italy. 


Foreword 


The  core  question  addressed  by  this  project  is  how  to  monitor  a  large  area  such  as  a  shopping  mall,  airport, 
or  university  campus  with  a  set  of  video  cameras  and  determine  automatically  who  is  where  at  all  times. 
This  problem  is  sometimes  called  identity  management.  A  system  to  accomplish  this  task  is  of  obvious 
usefulness  in  security,  surveillance,  crowd  control  and  monitoring,  and  related  applications. 

Identity  management  is  difficult  for  a  variety  of  reasons:  Different  people  may  dress  or  look  alike,  and, 
conversely,  the  same  person  may  look  different  under  different  lighting  or  from  different  viewpoints.  The 
cameras  may  not  cover  the  area  of  interest  seamlessly,  and  people  may  disappear  from  one  field  of  view  and 
reappear  in  another  one  only  after  a  long  interval  of  time,  or  not  at  all.  Even  more  fundamentally:  How  to 
distinguish  a  person  from  something  else?  How  to  track  a  person  reliably  in  the  face  of  uneven  and  complex 
motion,  body  articulation,  occlusions,  changes  of  lighting  and  viewpoint? 

Conceptually,  identity  management  turned  out  to  translate  to  interesting  mathematical  problems  at  all 
levels  of  the  system.  At  the  core  level  of  data  association — which  observation  relates  to  what  identity — we 
formulated  a  Binary  Integer  Program  that  captures  simply  and  precisely  the  overall  structure  of  the  problem. 
At  the  level  of  visual  tracking,  we  proposed  a  new  Lagrangian  model  of  image  motion  that  describes  the 
trajectories  of  every  pixel  in  every  frame  of  a  long  video  sequence.  At  the  lowest  levels  of  image  processing 
and  motion  analysis,  we  devised  new  lighting  models,  person  detection  algorithms  that  combine  information 
from  color  and  depth  cameras,  new  ways  to  describe  image  shape,  pictorial  structures  that  accelerate  person 
detection,  and  much  more.  We  marked  our  exciting  journey  towards  the  development  of  a  state-of-the-art 
theory  of  identity  management  with  17  publications  in  the  top  computer  vision  venues  (Table  [I]  on  page  [2]). 

For  performance  evaluation,  we  developed  the  largest  fully-annotated  data  set  to  date,  with  more  than  1 
million  frames  of  high-quality  video  and  more  than  2000  identities.  Dozens  of  students  and  friends  spent 
many  hours  annotating  every  person  in  every  frame.  We  are  about  to  make  this  data  set  available  to  the 
research  community,  and  this  contribution  alone  is  likely  to  accelerate  progress  in  visual  tracking  and  identity 
management  by  allowing  to  compare  competing  approaches  in  a  fair  and  thorough  manner.  We  are  also 
proud  of  a  mathematically  simple,  new  methodology  that  we  have  developed  to  evaluate  the  performance  of 
identity  management  systems. 

This  research  was  a  playing  field  for  the  intellectual  growth  of  nine  students:  Zhiqiang  Gu,  Ying  Zheng, 
and  Susanna  Ricco  graduated  with  PhD  theses  directly  tied  to  the  project,  and  are  now  successfully  em¬ 
ployed  at  Google  Research  and  Apple.  Ergys  Ristani  and  Cassandra  Carley  are  working  towards  their  PhDs, 
and  have  published  several  articles  while  funded  by  this  grant.  Alan  Davidson  and  Branka  Lakic  have  earned 
Master  degrees  in  Computer  Science  with  work  that  advanced  various  aspects  of  this  project.  And  under¬ 
graduates  Trevor  Terris  and  Sterling  Dorminey  received  their  first  exposure  to  advanced  computer  science 
research  thanks  to  their  work  with  us.  Graduate  students  helped  mentor  undergraduates,  thereby  learning  a 
skill  that  will  serve  them  throughout  their  professional  lives. 

To  address  some  of  the  challenges  of  our  project  we  reached  out  to  groups  in  other  countries,  and 
particularly  with  CICATA,  the  Centro  the  Investigacion  en  Ciencia  Aplicada  y  Technologia  Avanzada  in 
Queretaro,  Mexico,  and  the  ImageLab  at  the  University  of  Modena  and  Reggio  Emilia  in  Italy.  These 
collaborations  continue,  and  open  our  minds  and  those  of  our  students  to  different  ways  of  thinking  and 
working.  They  would  not  have  started  without  ARO  funding  for  this  project. 

We  are  therefore  very  grateful  to  the  Army  Research  Office  for  their  support  of  our  work,  and  for  the 
insight  and  foresight  of  its  officers  who  saw  in  a  somewhat  speculative  proposal  more  than  five  years  ago  the 
potential  for  transformative  research  in  a  field  that  has  in  the  meantime  blossomed — perhaps  even  in  small 
part  thanks  to  our  work — into  a  thriving  subfield  of  computer  vision  across  the  world. 
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(a)  Velocity  estimation  of  the  blue  detection  for  m  =  3.  Circles  are  detections,  the  horizontal 


dimension  is  time,  and  the  vertical  one  stands  for  2D  space.  Green  detections  are  the  nearest 


detections  in  space  to  the  blue  detection  for  each  k.  Detections  in  grey  are  not  considered  for 


velocity  estimation.  Detection  p i  is  discarded  because  the  speed  required  to  reach  the  blue 


detection  from  it  exceeds  a  predefined  limit.  The  green  vectors  are  the  velocities  computed 


for  each  blue-green  detection  pair  and  the  blue  vector  is  the  estimated  velocity,  (b)  Circles 


enclose  disjoint  space-time  groups,  found  from  assumed  bounds  on  walking  speed .  6 


The  proposed  single-camera  processing  pipeline  from  detections  to  trajectories.  One  more 


inter-camera  stage  aggregates  trajectories  into  identities.  .  10 


MOTA  scores  (a,  c)  and  running  times  (b,  d)  as  functions  of  the  length  of  the  sliding  window 


(a,  b)  and  the  number  of  appearance  groups  (c,  d)  for  the  Town  Center  sequence.  Solver 


time  indicates  how  much  time  was  spent  for  assembling  and  solving  all  the  Binary  Integer 


Programs.  The  total  running  time  also  includes  the  time  for  computing  correlations,  but 


does  not  account  for  person  detection.  Figures  (a)  and  (b)  are  for  ten  appearance  groups, 


and  Figures  (c)  and  (d)  are  for  8-second  sliding  windows.  Best  viewed  on  screen. .  16 


4  Retrieval  results  on  a  subset  of  our  data  set  (Sec.  |2.4.2|)  based  on  appearance  only.  Existing 


methods  fail  in  exploiting  the  availability  of  more  frames  to  increase  the  descriptor  robustness.  19 


A  spatiotemporal  cube  of  the  marple7  sequence.  Time  runs  from  left  to  right.  The  corner 


of  the  crate  (cyan)  is  first  occluded  by  Miss  Marple’s  arm  (green)  in  frame  12.  A  small 
patch  (red  dashed  squares)  around  each  path  in  every  frame  is  transported  along  the  cur¬ 


rent  path  estimates  and  monitored  for  consistent  appearance.  The  arm  patch  (top  right)  is 
most  consistent,  and  makes  this  the  controlling  path  at  that  point  and  frame.  Points  along 


paths  that  either  coincide  with  or  are  substantially  parallel  to  a  nearby  controlling  path  have 
their  observed  visibility  flag  vp(t)  set  to  1.  All  other  flags  are  set  to  0.  Observed  flags  af- 


fect  the  estimated  visibility  flags  at  the  nodes  of  a  MRF  that  enforces  spatial  and  temporal _ 

consistency  of  the  flags  and  ensures  that  at  least  one  path  is  visible  at  every  pixel. | . 29 


Anchor  point  selected  during  initialization  (a)  and  at  convergence  (b).  Colors  other  than 


gray  denote  anchor  points,  and  similar  colors  denote  similar  sets  of  path  coefficients.  Note 


the  improved  segmentation  of  Miss  Marple  after  convergence .  32 


Effect  of  path  regrouping.  Motion  estimates  are  shown  using  the  same  color  scheme  as  in 


Figure  6|  The  first  image  in  each  pair  shows  the  solution  after  the  first  round  of  optimiza¬ 
tion;  the  second  shows  results  at  convergence.  Regrouping  (|b|)  recovers  from  a  poor  local 


optimum  with  incorrect  estimates  for  the  motion  of  the  occluded  background. .  32 
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last  frame  warped  to  align  with  the  first  frame,  and  vice  versa.  Regions  detected  as  occluded 


in  the  source  frame  of  the  warp  are  marked  in  white.  Solution  times  (rounded  to  nearest 


half-hour)  exclude  basis  computation. .  34 
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1  Problem  Statement 


Several  human  behavior  analysis,  monitoring,  and  surveillance  scenarios  would  benefit  from  automatic 
methods  that  track  multiple  people  through  a  network  of  cameras.  The  Identity  Management  (IM)  problem — 
to  determine  who  is  where  at  all  times — is  commonly  cast  as  a  data  association  problem:  Partition  all  obser¬ 
vations  of  individuals  found  in  every  frame  from  every  camera  into  a  number  of  sets,  one  per  identity. 

Individual  observations  are  typically  the  outputs  from  a  person  detector,  and  often  come  in  the  form  of 
bounding  boxes.  Detectors  are  very  good  nowadays,  but  they  are  not  perfect,  and  may  miss  people  or  find 
people  where  there  aren’t  any.  In  addition,  multiple  bounding  boxes  often  cover  the  same  person  in  any 
given  frame.  So  the  initial  observations  come  with  both  false  positives  and  false  negatives. 

To  curb  computational  complexity  and  handle  unbounded  time  horizons,  the  IM  problem  is  typically 
solved  over  a  sliding  temporal  window  and  in  layers.  Specifically — and  consistently  with  the  terminology 
used  in  most  of  the  literature — a  detection  is  a  response  from  a  person  detector  from  one  camera;  a  tracklet 
aggregates  detections  in  consecutive  frames  into  short  sequences,  and  a  trajectory  is  a  sequence  of  tracklets. 
Each  detection,  tracklet,  or  trajectory  is  an  observation.  It  is  constructed  from  data  from  a  single  camera 
and  comes  with  timing  and  appearance  information  which  has  different  formats  for  different  observation 
types.  Tracklets  are  strung  into  identities ,  that  is,  sequences  of  trajectories  from  one  or  more  cameras.  Each 
identity  is  meant  to  correspond  to  a  single  person,  and  different  people  to  different  identities.  Tracklets, 
trajectories,  and  identities  are  aggregates. 

The  problem  can  then  be  defined  as  follows:  Observations — at  whatever  level — are  nodes  in  a  graph 
in  which  each  edge  has  a  correlation ,  that  is,  a  measure  of  the  positive  or  negative  evidence  that  two 
observations  pertain  to  the  same  person.  Identity  management  is  the  problem  of  partitioning  the 
nodes  of  the  graph  into  sets,  one  set  per  identity,  so  that  correlations  within  each  set  are  maximized. 

This  formulation  will  be  made  more  precise  in  the  next  Section. 

What  changes  between  layers  is  what  evidence  is  encoded  in  the  correlations:  For  short-term  matching, 
a  person’s  appearance  may  be  captured  with  image-domain  descriptors  such  as  histograms  of  colors  or  ori¬ 
ented  gradients.  In  contrast,  capturing  a  person’s  appearance  along  a  trajectory  presents  both  the  challenge 
of  formulating  a  view-invariant  signature  and  the  opportunity  for  a  more  nuanced  descriptor  based  on  mul¬ 
tiple  views  of  the  same  person.  We  developed  a  new,  rich  way  to  capture  long-term  appearance.  Also,  time 
and  motion  may  play  a  strong  role  in  low-level  associations,  because  people  move  more  or  less  predictably 
in  the  short  term.  For  longer-term  associations,  on  the  other  hand,  time  and  motion  are  only  loosely  relevant, 
since  a  person  may  occasionally  change  direction  or  speed,  or  stop  while  out  of  sight. 

As  IM  methods  solve  larger  and  larger  problems,  it  becomes  increasingly  important  to  define  perfor¬ 
mance  measures  that  can  handle  any  and  all  levels  of  aggregation  consistently.  While  several  such  measures 
have  appeared  in  the  literature,  they  tend  to  be  reliable  only  either  across  cameras  or  within  cameras,  but 
rarely  for  the  system  as  a  whole.  In  particular,  current  measures  fail  to  satisfactorily  address  the  truth-to- 
result  matching  problem :  A  given  ground  truth  trajectory  may  be  claimed  by  different  computed  identities 
that  span  different,  and  sometimes  even  overlapping,  time  intervals.  Which  computed  identity  gets  how 
much  credit  for  that  ground-truth  trajectory?  We  developed  a  simple  and  general  answer  to  this  question, 
and  derived  precision  and  recall  measures  that  match  the  IM  scenario  better  than  existing  measures. 

We  evaluated  our  system  on  a  standard  benchmark,  and  ran  separate  experiments  to  showcase  the  specific 
advantages  of  our  new  similarity  measure.  In  addition,  we  experimented  with  a  new  data  set  we  built  that 
has  more  than  1  million  frames,  fully  annotated  trajectories,  and  more  than  2000  identities.  It  consists  of 
8  x  85  minutes  of  1080p  video  recorded  at  60  frames  per  second  from  8  cameras  deployed  on  the  Duke 
University  campus  during  periods  between  lectures,  when  pedestrian  traffic  is  heavy. 
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2  Summary  of  Results 

We  made  the  following  contributions  to  the  core  problem  of  Identity  Management  (IM): 

•  A  unified  formulation  that  applies  both  within  and  between  cameras,  handling  both  overlapping  and 
disjoint  views  seamlessly. 

•  A  new  3D  descriptor  of  trajectory  appearance  that  captures  informative  appearance  details  with  mini¬ 
mal  blurring. 

•  New,  consistent,  and  simple  measures  of  performance. 

•  A  large,  fully-annotated  data  set. 

•  Experiments  and  comparisons. 

We  describe  these  contributions  in  a  recent  article  ll65l  and  in  a  paper  in  preparation,  and  we  summarize 
them  in  this  report. 

The  contributions  above  address  the  core  problems  of  IM.  To  make  a  practical  system  possible,  we 
also  had  to  revisit  the  standard  computer  vision  components  on  which  such  a  system  must  rely:  Methods 
for  detecting  people  in  individual  frames  and  tracking  them — or  something  else,  for  that  matter — between 
frames,  as  well  as  basic  mathematical  models,  data  structures,  and  algorithms  that  make  these  methods 
robust  and  efficient. 

These  ancillary  contributions  are  discussed  in  additional  papers  listed  in  Table[l]  The  mot  important  one, 
a  Lagrangian  formulation  of  image  motion,  is  summarized  in  the  Appendix.  We  do  not  use  this  formulation 
directly  in  our  work  on  identity  management  described  in  the  remainder  of  this  report,  mainly  because  of  its 
high  computational  cost.  Nonetheless,  the  development  of  this  method  has  led  to  many  improvements  in  the 
motion  analysis  aspects  of  our  main  line  of  work.  In  addition,  the  Lagrangian  formulation  can  lead  to  person 
detections  that  (i)  are  optimized  over  an  entire  video  sequence  rather  than  frame  by  frame,  and  (ii)  delineate 
the  image  region  occupied  by  a  person,  rather  than  a  rectangular  bounding  box  around  it.  Our  identity 
management  experiments  show  that  the  contamination  of  detection  descriptors  caused  by  background  pixels 
present  in  the  bounding  boxes  is  a  significant  source  of  identification  errors.  Because  of  this,  we  believe  that 
further  study  is  warranted  to  make  the  Lagrangian  formulation  computationally  more  efficient  and  therefore 
practically  useful,  and  we  stand  by  the  conceptual  and  theoretical  validity  of  this  formulation. 

The  following  sections  cover  prior  art  in  Identity  Management  (IM),  our  approach,  our  new  performance 
measure,  and  experimental  results. 

2.1  Related  Work 

Current  research  on  multi-camera  identity  management  relies  on  improved  pedestrian  detection  lfl4l  and 
single  camera  tracking,  and  most  multi-camera  IM  methods  even  assume  availability  of  single-camera  tra¬ 
jectories  at  test  time  fl2Tl  l27l  l3ll  l32l  l33l  l37l  l38l  l43l  l48l  l54l  [58l  [80).  Exceptions  ll29l  I8H  solve  single  and 
multi-camera  in  two  steps  through  the  same  optimization  framework. 

Camera  Placement  Information.  Spatial  relations  between  cameras  are  explicitly  mapped  in  3D  [3T> 
m,  learnt  by  tracking  known  identities  j48j  '4j2  28]  or  by  comparing  entry/exit  rates  across  pairs  of  cam¬ 
era  127.  l54l  58l.  or  discovered  on-line  [43  ,  8JJ.  Pre-processing  methods  may  fuse  data  from  partially  over¬ 
lapping  views  I8TL  while  work  from  completely  overlapping  and  unobstructed  views  has  been  regarded  as 
a  separate  task  in  the  literature  Hl[l6l[2lll50ll441l. 
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Entry  and  exit  points  (EEPs)  are  sources  and  sinks  of  individuals,  and  may  be  explicitly  estimated  as 
mixtures  of  Gaussians  on  the  ground  plane  (3111271 1541158).  Image-space  approaches  cluster  pixels  on  image 
boundaries  (49l  or  split  the  image  adaptively  to  localize  EEPs  (43).  Online  formulations  [31 1  can  handle 
time- varying  EEPs. 

Travel  time  information  is  modeled  with  Gaussians  [49,  80]  or  in  non-parametric  ways  OT1  l43l  l48l  [54l 

[35). 

Appearance  Descriptors.  Color  is  summarized  with  RGB/HSV/YCbCr  histograms  [27,  31,  32,  33,  38, 
l43l  l48l  l49l  l54l  l80l  [8H  on  foreground  masks  (33l  [37l  [38l  @3}  [49]|.  Texture  is  modeled  through  covariance 
matrices  (54).  local  binary  patterns  021  or  (P)HOG  features  lf27l [37l [54l [80l [8H.  To  handle  lighting  differ¬ 
ences  between  cameras,  methods  either  employ  color  normalization  f27l.  exemplar  based  approaches  ll32ll 
or  learn  brightness  transfer  functions  [E8 » [48j  even  without  labeled  data  131 , 43ll80l  81 1. 

Discriminative  power  is  improved  by  incorporating  saliency  information  [59,  [82]  or  by  attaching  color 
and  texture  features  to  different  body  parts  inside  the  bounding  box  [27],  32,  33l  [37l  [38l  l49l  54],  either  in 
the  image  plane  task  [[HI  [TT1 [35j|  or  back-projected  onto  an  articulated  Q  [34)  or  non-articulated  3D  body 
models  ®. 

Multi-view  descriptors  are  sometimes  obtained  by  averaging  descriptors  over  several  frames  [32]  or 
by  generating  random  transformations  for  improved  comparisons  between  descriptors  071.  Learning  is 
sometimes  employed  to  weigh  features  differently  for  distinct  pairs  of  cameras  (32l  54[]  or  to  discover  target- 
specific  features  (27). 

Multi-Camera  IM  Formulations.  Spatial,  temporal,  and  appearance  information  is  summarized  into 
weights  Wij  that  somehow  express  the  affinity  of  observation  i  with  observation  j.  One  must  eventually 
decide  whether  these  observations  pertain  to  the  same  identity  or  not. 

For  trajectories,  bipartite  matching  can  make  this  determination  for  each  pair  of  cameras,  but  consistency 
of  results  across  camera  pairs  is  not  guaranteed  (38).  It  can  be  enforced  through  global  methods  in  which 
trajectories  are  nodes  in  a  graph  and  the  WijS  are  edge  weights.  Path  methods  compute  paths  in  the  graph 
by  maximizing  the  sum  of  weights  between  consecutive  path  nodes,  while  clique  methods  maximize  the 
sum  of  weights  on  all  edges  in  a  clique.  Path  and  clique  methods  apply  to  tracklets  within  cameras,  or  to 
trajectories  across  cameras.  Some  notable  contributions  are  as  follows: 
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Both 
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In  our  work,  we  formulate  within  and  across-camera  tracking  in  a  single  unified  framework,  similarly  to 
previous  IM  flow  methods  (291I8H.  In  contrast  with  these,  our  formulation  is  a  clique  method  and  can  also 
handle  identities  that  reappear  in  the  same  field  of  view.  Similarly  to  (38 1.  we  consider  evidence  across  the 
whole  network. 

Moreover,  we  describe  trajectory  appearance  with  a  new  signature  that,  differently  from  all  related  work, 
does  not  average  descriptors  appearing  at  the  same  location  but  rather  tracks  and  collects  patches  inside  the 
detection  bounding  box  over  time.  We  map  signatures  to  a  3D  body  model  only  when  two  signatures 
are  compared  to  each  other,  and  without  averaging.  This  prevents  appearance  details  to  be  blurred  away, 
exploits  information  from  the  entire  trajectory,  and  accounts  for  viewpoint  and  pose  variation  in  a  flexible 
and  nuanced  way. 
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2.2  Mathematical  Problem  Formulation 


As  mentioned  earlier,  a  detection  is  a  response  from  a  person  detector  from  one  camera;  a  tracklet  aggregates 
detections  in  consecutive  frames  into  short  sequences,  and  a  trajectory  is  a  sequence  of  tracklets.  Each 
detection,  tracklet,  or  trajectory  is  an  observation.  Tracklets  are  strung  into  identities ,  that  is,  sequences  of 
trajectories  from  one  or  more  cameras.  Each  identity  is  meant  to  correspond  to  a  single  person,  and  different 
people  to  different  identities.  Tracklets,  trajectories,  and  identities  are  aggregates. 

In  the  following,  we  first  describe  our  mathematical  aggregation  framework,  which  is  common  to  all 
types  of  observations,  and  relies  on  a  graph  in  which  observations  are  nodes  and  measures  of  correlation 
between  observations  are  edges.  Detections,  tracklets,  and  trajectories  are  different  types  of  observations 
because  of  the  differences  in  the  amounts  of  time  and  variety  of  viewpoint  that  they  entail.  Because  of  this, 
node  descriptors  and  edge  weights  are  different  for  different  types  of  observations,  and  are  described  in 
subsequent  sections. 

In  a  typical  security  or  surveillance  scenario,  observations  constitute  a  flow  of  unbounded  duration. 
Section  [2.2. 6|  show  how  we  stitch  multiple  instances  of  binary  integer  programs  into  a  processing  cascade 
that  can  handle  an  unbounded  flow  of  observations. 


2.2.1  Aggregation 

In  contrast  with  the  literature,  we  formulate  each  of  the  three  aggregation  steps  above  as  a  Binary  Integer 
Program  (BIP)  of  exactly  the  same  format. 

Specifically,  given  edges  (z,  j)  G  E  with  signed  weights  W{j  between  observations  z,  j  G  V,  we  set  a 
binary  variable  Xij  to  1  if  the  observations  are  to  be  in  the  same  identity  and  to  0  otherwise.  Weights  can  be 
defined  between  any  two  observations  of  the  same  type,  consecutive  or  not.  The  graph  G  —  (V,  E )  need  not 
be  complete.  For  each  aggregation  problem  we  then  find  the 

arg  max  E  W ij  Xij  (1) 

(ij)eE 

subject  to  the  following  constraint  on  all  triangles  of  G 


x . 


+  Xjk  <  1  +  Xik 


(2) 


to  enforce  transitivity:  If  z  and  j  are  in  the  same  identity  and  so  are  j  and  k,  then  z  and  k  must  be  in  the  same 
identity,  too. 

Remarkably,  the  two  expressions  above  capture  aggregation  precisely,  and  at  all  levels  of  the  processing 
pipeline.  Finding  an  optimal  solution  to  this  correlation  clustering  problem  is  NP-hard  [9l  and  the  problem 
is  also  hard  to  approximate  f73ll.  The  best  known  approximation  algorithm  achieves  an  approximation  ratio 
of  0.7664  [72],  but  its  semi-definite  program  formulation  makes  it  slow  for  practical  consideration.  These 
results  suggest  that  one  needs  to  look  at  the  special  properties  of  the  multi-person  tracking  problem  to  find 
an  efficient  solution,  as  we  do  in  Section |2~.2.6| 


2.2.2  Detection  Descriptors  and  Correlations 

Each  person  detection  D  =  (<£>,  p,  t ,  v)  is  described  by  its  appearance  feature  cp,  position  p,  time  stamp  t, 
and  estimated  velochyQv.  We  use  an  HSV  color  histogram  to  describe  a  person’s  appearance,  but  different 
descriptors  can  be  used  with  no  other  modification  of  the  proposed  methods. 

1  Velocity  is  a  vector,  and  its  norm  is  called  the  speed. 
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Figure  1:  (a)  Velocity  estimation  of  the  blue  detection  for  m  —  3.  Circles  are  detections,  the  horizontal 
dimension  is  time,  and  the  vertical  one  stands  for  2D  space.  Green  detections  are  the  nearest  detections  in 
space  to  the  blue  detection  for  each  k.  Detections  in  grey  are  not  considered  for  velocity  estimation.  De¬ 
tection  p_i  is  discarded  because  the  speed  required  to  reach  the  blue  detection  from  it  exceeds  a  predefined 
limit.  The  green  vectors  are  the  velocities  computed  for  each  blue-green  detection  pair  and  the  blue  vector 
is  the  estimated  velocity,  (b)  Circles  enclose  disjoint  space-time  groups,  found  from  assumed  bounds  on 
walking  speed. 


Co-identity  evidence  from  space  and  time  information  comes  mainly  from  the  assumption  that  people 
are  limited  in  their  speed,  and  reasoning  about  person  speed  requires  converting  image  coordinates  to  world 
coordinates.  To  this  end,  we  assume  that  people  move  on  a  planar  region  and  that  a  homography  is  available 
between  the  world  and  the  image. 

The  velocity  of  a  detection  at  position  p  in  video  frame  i  is  estimated  as  follows.  For  each  frame  k  in 
[i  —  m,  i  +  m]  (where  m  is  a  small  integer)  and  k  /  i,  determine  the  detection  p&  that  is  nearest  (in  space) 
to  p.  Compute  the  velocities  from  each  pair  (p,  p^),  and  discard  those  that  violate  a  predefined  speed  limit. 
The  velocity  estimate  for  the  detection  at  p  is  then  the  component-wise  median  of  the  remaining  velocities. 
See  Figure  [TJa). 

Given  two  detections  D\  =  (¥>l3  Pi,£i,  vi)  and  D2  =  P2>  h,  v2),  we  first  define  two  simple 

space-time  and  appearance  affinity  measures  for  them  in  [0, 1],  and  then  combine  the  affinities  into  a  single 
correlation  measure. 

Specifically,  the  space-time  affinity  of  D\  and  D2  is: 

sst  =  max[l  -  /3  (e(L>i,  D2)  +  e(D2,  Di)),  0]  (3) 

where  e(Di,D2)  =  ||qi  —  P2II2  measures  the  error  between  the  position  p2  of  detection  D2  and  the  esti¬ 
mated  position  qi  =  pi  +  vi  (t2  — 1{)  of  detection  D\  at  time  t2.  The  parameter  (3  controls  how  much  error 
we  are  willing  to  tolerate.  Setting  a  lower  value  for  (3  is  helpful  for  handling  long  occlusions.  We  use  (3  =  1. 
The  appearance  affinity  between  D\  and  D2  is: 

sa  =  max[l  -  a  d((f1,cp2)1  0]  (4) 

where  d(-)  is  a  distance  function  in  appearance  space.  We  use  the  earth  mover’s  distance  ll67ll  in  our  experi¬ 
ments  to  compare  HSV  histograms,  and  set  a  =  1. 
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A  sigmoid  function  maps  affinities  to  correlations  smoothly,  except  in  extreme  cases: 


{-00  if  sasst  =  0 

+oo  if  sasst  =  1  .  (5) 

—  1  +  yj - t — ^7 - vr  otherwise 

l+exp(— A(sasst— /i)) 

The  parameter  A  determines  the  width  of  the  transition  band  between  negative  and  positive  correlation,  and 
/i  is  the  value  that  separates  them.  We  use  fi  =  0.25,  assuming  that  sa  =  st  =  0.5  indicates  indifference. 


2.2.3  Tracklet  Descriptors  and  Correlations 


Tracklets  (short  trajectories  of  detections)  have  somewhat  more  complex  descriptors  than  individual  detec¬ 
tions,  because  they  extend  over  time.  Specifically,  a  tracklet  descriptor  T  =  {<£>,  ps,  pe,  fe,  v}  contains 
an  appearance  feature  £p  that  is  equal  to  the  median  appearance  of  its  detections.  The  descriptor  also  contains 
the  start  point  ps  and  end  point  pe  of  the  tracklet,  its  start  time  ts  and  end  time  te ,  and  its  velocity  v.  Since 
tracklets  are  short,  we  assume  that  their  detections  are  on  a  straight  line  and  we  approximate  the  velocity  of 
the  tracklets  as  follows: 


v  — 


(6) 


The  definition  of  appearance  affinity  remains  the  same  for  tracklets,  once  appearance  descriptors  are 
modified  as  explained  earlier.  For  space-time  affinities,  the  position  error  e(-,  •)  is  redefined  to  measure  the 
discrepancy  between  a  tracklet’ s  start  point  and  the  estimated  start  point  as  determined  from  the  end  point 
of  the  other  tracklet:  e(Ti,  T2)  =  ||qf  —  ll2  where  qf  =  pf  +  vi^  —  tf ). 


2.2.4  Trajectory  Descriptors  and  Correlations 

Large  variations  in  viewpoint  and  pose  make  comparing  trajectories  harder  than  comparing  individual  detec¬ 
tions  or  even  tracklets,  for  which  histograms  and  HOG-like  descriptors  may  be  adequate.  This  problem  has 
been  addressed  recently  El  by  making  the  comparison  in  3D  in  the  context  of  the  so-called  re-identification 
problem.  Specifically,  color  histograms  and  HOG-like  descriptors  are  back-projected  from  the  images  onto  a 
simplified  body  model  called  a  sarcophagus  by  using  camera  calibration  information,  and  descriptors  from 
different  snapshots  are  collected  and  eventually  averaged  together.  The  robustness  of  this  method  is  thus 
limited  to  one’s  ability  to  project  information  on  the  correct  vertex  of  the  mesh.  In  practice,  pose  changes, 
bounding  box  placement  errors,  and  camera  calibration  inaccuracies  often  cause  the  averaging  of  descrip¬ 
tors  that  represent  different  body  parts,  with  consequent  loss  of  detail  in  the  feature  representation.  These 
details  are  crucial,  since,  say,  a  bag  or  the  color  of  one’s  shoes  may  distinguish  otherwise  similarly  dressed 
individuals. 

To  overcome  this  limitation,  we  do  not  average  descriptors  on  the  3D  model.  Instead,  to  remove  frame- 
wise  noise,  we  first  average  image  descriptor  histograms  over  the  short  tracks  computed  with  a  Lucas- 
Kanade  tracker,  initialized  over  a  grid  of  points  on  the  foreground  mask  of  the  personf78l.  When  a  track  is 
lost,  a  new  point  is  initialized  at  the  same  grid  location. 

These  track  averages  are  more  accurate  than  3D  averages  because  they  do  not  involve  any  back-projection, 
and  the  tracker  vouches  for  correct  correspondence.  For  each  track  we  store  means  and  variances  of  descrip¬ 
tor  histograms,  as  well  as  back-projected  body  position.  A  single  trajectory  T  then  comes  with  a  set  of  short 
tracks  {p},  each  described  by  a  mean  descriptor  histogram  /x,  a  vector  a  of  bin  variances,  the  number  n  of 
frames  in  the  track,  and  the  set  Q  =  {qi, . . . ,  qn}  of  estimated  3D  body  positions. 
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To  compare  two  trajectories,  we  first  define  a  measure  of  similarity  s( pa,  pb )  between  two  track  his¬ 
tograms  that  accounts  for  the  statistical  significance  of  each  bin.  We  then  extend  this  measure  to  trajectories 
and  convert  (non-negative)  similarities  to  (signed)  correlations,  so  we  can  apply  the  BIP  approach  described 
earlier. 


Track  Similarity.  The  means  and  variances  a  can  be  viewed  as  Gaussian  approximations  of  a  distri¬ 
bution  over  track  descriptor  histograms  h  with  independent  bins.  To  compare  two  histograms  ha  and  hb 
we  ask:  (i)  What  evidence  do  corresponding  bins  hk  and  hbk  provide  for  the  hypothesis  Hq  that  the  two 
histograms  pertain  to  the  same  person?  (ii)  How  significant  is  this  evidence? 

We  answer  the  first  question  by  estimating  the  conditional  probability  P(Hq  |  h^,  h^)  through  Welch’s 
test  Ii75l  of  the  hypothesis  that  the  two  populations  with  empirical  parameters  (/i^,  ak)  and  (/i|,  <Jk)  have 
equal  true  means.  This  test  extends  to  unequal  variances  the  more  popular  Student’s  test.  The  f-values  for 
bin  k  are  computed  as 


4(hfe,h|) 


K)2  ,  K)2 


+ 


(7) 


and  then 


P(KA 


/•- lt(h“,h£)l 

Ho)  =  2  f(t,v(hak,hb))dt, 

J — oo 


(8) 


where  f(t ;  v)  is  the  probability  density  function  of  the  ^-distribution  and  the  degrees  of  freedom  v  can  be 


estimated  in  closed  form  as: 


'K)2  ,  K): 


+ 


K)4 


+ 


K)4 


(na)2(na—l)  (nb)2(nb  —  1) 


(9) 


which  we  round  to  an  integer.  Assuming  independence  between  the  two  histograms,  Bayes’s  theorem  yields 


P(H0\hak,hbk)  =  P(hlhbk\H0) 

S - v - ' 

p-  value  or  likelihood 


P(Ho) 

P(h%)P(hbk)  ‘ 


(10) 


We  estimate  P(h^)  and  P{ hbk)  by  computing  histograms  on  a  regular  grid  of  body  location  and  over  the 
entire  training  set.  The  constant  term  P(Hq)  is  ignored. 

To  reflect  the  greater  significance  of  large  bin  values  for  histogram  comparison  (consideration  (ii)  above), 
we  weigh  each  probability  above  with  the  size  of  the  smaller  of  iik  and  tibk — a  factor  reminiscent  of  histogram 
intersection — to  obtain  the  track  similarity  measure 


s(p“,p6)  =  y^.P(#o|h£,h \)  min(/i|., //].)  .  (11) 

k 


Trajectory  Correlation.  We  extend  our  track  similarity  measure  to  trajectories  by  defining  a  track  neigh¬ 
borhood  by  geodesic  distance  on  the  3D  body  model,  and  comparing  tracks  only  when  they  are  close  to 
each  other  in  this  distance.  To  prevent  double-counting,  we  perform  non-maximum  suppression  within 
track  neighborhoods. 
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Specifically,  define  the  distance  between  tracks  pa  and  as  the  shortest  geodesic  distance  S — measured 
on  the  3D  model — between  them, 


d(  P“,P&)=  min  5(qa,q6)  (12) 

qaeQa,qbeQb 

and  the  neighborhood  of  track  pa  in  trajectory  Tb  as  the  set 

U{ Pa,  Tb )  =  {pfe  G  Tb  I  d( pa,  pb )  <  p}  (13) 


where  p  is  a  threshold.  We  average  similarities  of  pa  over  nearby  tracks  in  Tb  to  obtain  a  track-to-trajectory 
similarity 


s(pa,Tb) 


1 

Wp^)\ 


E  s(pa,p6)- 

pbeu(pa,Tb) 


(14) 


We  then  perform  non-maximum  suppression  by  retaining  a  track  pa  iff  it  achieves  the  maximum  similarity 
s( pa,  Tb )  in  U(pa,  Ta ).  Finally,  we  convert  similarities  to  correlations 


c(pa,Tb)  =  s(pa,Tb)-Z 


(15) 


where  Z  is  determined  by  cross-validation  and  denotes  the  indifference  threshold  between  positive  and 
negative  evidence  ll66l.  and  average  the  track-to-trajectory  correlations  to  obtain  a  trajectory  correlation 


c(Ta,  Tb) 


£p«gt»  c(pa,  Tb )  +  £p6eTfc  c(pb,  Ta) 
\Ta  \  +  \Tb\ 


(16) 


Joint  rather  than  separate  normalization  in  this  equation  prevents  trajectories  with  a  low  number  of  tracks 
from  biasing  the  correlation  score. 

With  this  definition,  the  3D  model  is  used  mainly  to  achieve  view-invariance  through  back-projection, 
while  all  the  track  appearance  information — averaged  in  the  image  plane  and  only  as  long  as  a  feature  tracker 
vouches  for  correct  correspondence — is  used  for  trajectory  comparisons.  Averaging  is  eventually  performed 
on  similarities  and  correlations,  rather  than  directly  on  appearance  information. 


2.2.5  Implementation  Details 


The  side  of  the  patches  contained  in  each  track  is  set  to  ^  of  the  bounding  box  height,  so  that  the  same  patch 
always  captures  the  same  amount  of  information  independently  of  the  scale.  At  each  frame  and  from  each 
patch  we  extract  HSV  histograms.  The  color  space  has  been  adaptively  quantized  in  100  bins  according 
to  the  minimum  variance  criteria  f45l  with  respect  to  our  training  data  (see  Sec.  |2.4.2|).  On  the  same  data, 
the  parameters  Z  =  10,  p  =  0.3m  and  0.2m  for  the  non-maximum  suppression  radius  has  also  been  cross- 
validated.  For  the  optimization  framework,  the  last  layer  sliding  window  is  2.5min  wide  with  a  stride  of  half 
its  size. 


2.2.6  The  Single-Camera  Cascade 


We  now  describe  a  cascade  that  allows  solving  the  graph  partitioning  problem  defined  in  Section  2.2.1 


approximately  and  efficiently  over  a  temporal  window  several  seconds  long  and  for  each  camera  separately. 
The  longer  the  window,  the  longer  the  occlusions  through  which  identities  can  be  retained.  Although  we  lose 
theoretical  guarantees  of  optimality,  we  exploit  the  special  structure  of  multi-person  tracking  to  decompose 
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Figure  2:  The  proposed  single-camera  processing  pipeline  from  detections  to  trajectories.  One  more  inter¬ 
camera  stage  aggregates  trajectories  into  identities. 


the  large  BIP  problem  from  Section  [2.2. 1| into  manageable  chunks  that  are  unlikely  to  take  us  far  from  the 
optimal  solution.  Trajectories  output  by  one  such  pipeline  per  camera  are  then  aggregated  into  identities  by 
one  more  solution  to  the  basic  BIP  formulated  in  section IZ2T] 

Our  cascade  has  two  simpler  phases  divided  into  two  stages  each.  The  first  phase  partitions  detections 
over  short  time  horizons  and  results  into  tracklets ,  short  sequences  of  detections  that  can  be  safely  connected 
to  each  other  based  on  both  appearance  and  space-time  affinities.  The  second  phase  reasons  over  the  entire 
temporal  window,  and  partitions  tracklets  into  identities  (a.k.a.  trajectories).  Each  phase  has  in  turn  a  first 
stage  that  does  a  preliminary  partitioning  done  safely  by  simple  means  in  order  to  reduce  the  size  of  the  BIP 
in  that  phase,  and  a  second  stage  that  solves  a  BIP  exactly  to  utilize  all  evidence  optimally.  The  four  stages 
are  now  described  in  turn. 

Space-Time  Groups.  The  first  stage  divides  the  entire  video  sequence  into  1- second  intervals  and  uses 
hierarchical  agglomeration  [0  to  group  detections  within  each  interval  into  space-time  groups  (Figure |TJb)). 
Initially,  each  detection  is  in  a  separate  group.  The  algorithm  then  repeatedly  merges  the  pair  of  groups  that 
are  closest  to  each  other  in  space  until  k{  space-time  groups  are  formed  for  time  interval  i.  We  set  ki  to 
one  half  of  the  expected  number  of  visible  people  in  the  given  time  interval,  estimated  as  the  ratio  between 
the  total  number  of  detections  and  the  number  of  frames  in  the  interval.  Because  of  the  conservative  choice 
of  ki ,  it  is  unlikely  that  observations  that  belong  together  end  up  in  different  groups.  Even  if  they  do,  one 
person  will  end  up  split  into  different  identities,  and  the  trajectory  stage,  described  later,  has  an  opportunity 
to  undo  the  split. 

Tracklets.  The  second  stage  solves  a  BIP  exactly  for  the  observations  of  each  space-time  group,  using 
the  correlations  ([5])  for  evidence.  The  resulting  partitions  are  called  tracklets ,  and  are  at  most  one  second 
long  by  construction.  Solving  exact  BIPs  on  space-time  groups  ensures  that  both  appearance  and  space- 
time  evidence  are  used  optimally  within  this  short  time  horizon.  Missing  detections  are  recovered  using 
interpolation  or  extrapolation  and  tracklets  shorter  than  0.2  seconds  are  discarded  as  false  positives. 
Appearance  Groups.  The  third  stage  reasons  in  appearance  space  and  groups  tracklets  from  the  entire  tem¬ 
poral  window  into  appearance  groups  that  will  be  processed  independently  of  each  other  in  the  fourth  stage. 
Non-parametric  methods  for  discovering  appearance  groups  ll56l  are  a  good  fit  for  this  stage.  However,  we 
use  fc-means  and  set  the  number  k  of  clusters  manually  for  simplicity. 

The  wholesale  splitting  of  identities  across  different  appearance  groups  is  an  irrecoverable  error.  How¬ 
ever,  appearance  grouping  is  again  conservative,  in  that  two  observations  are  grouped  whenever  they  are  even 
just  loosely  similar.  The  main  assumptions  in  this  stage  are  that  a  person’s  appearance  can  have  only  short¬ 
lived  variations  ( e.g .,  partial  occlusions  or  shadows)  and  that  person  appearance  does  not  change  suddenly 
and  dramatically  (e.g.,  a  person  putting  on  a  rain  coat  while  hidden  behind  an  obstacle).  The  conservative 
nature  of  this  stage  typically  prevents  identity- split  errors,  and  a  few  incorrectly  assigned  observations  can 
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be  handled  similarly  to  false  positives  and  false  negatives. 

Trajectories.  The  last  stage  in  the  cascade  solves  a  separate  BIP  (exactly)  for  all  the  tracklets  in  each 
appearance  group  and  within  the  entire  temporal  window,  again  using  both  space-time  consistency  and 
appearance  similarity  as  evidence.  Missing  tracklets  for  each  trajectory  are  inferred  using  interpolation,  and 
very  short  trajectories  (shorter  than  2  seconds)  are  discarded  as  false  positives.  The  reduction  in  the  size  of 
the  BIPs  in  the  second  and  fourth  stage  of  our  cascade  allows  processing  long  temporal  windows  of  data  in 
real  time. 


2.2.7  Unlimited  Time  Horizon 


Typical  surveillance  video  streams  are  unbounded  in  length  and  require  real-time,  online  processing.  To  turn 
the  method  described  so  far  into  an  online  algorithm  we  employ  a  sliding  temporal  window.  The  temporal 
extent  of  the  window  is  set  ahead  of  time — and  depending  on  application — so  that  the  observations  in  it 
can  be  processed  in  real  time.  Video  frames  stream  in  continuously,  and  an  off-the-shelf  person  detector 
provides  the  needed  detections.  One-second-long  tracklets  are  continuously  formed  by  stages  1  and  2  of 
the  cascade,  and  added  to  the  input  data.  Once  a  window  is  processed  completely  as  explained  next,  it  is 
advanced  by  half  its  temporal  extent. 

All  the  tracklets  that  are  at  least  partially  contained  in  the  first  window  are  fed  to  the  second  phase  of 
the  cascade.  Stages  3  and  4  form  partial  trajectories,  and  missing  and  spurious  observations  are  handled 


as  explained  in  Section  [2.2. 6|  Partial  trajectories  are  never  undone,  but  they  can  be  extended  from  data  in 
subsequent  windows.  In  windows  after  the  first,  the  elementary  input  observations  for  stage  3  are  all  the 
tracklets  and  all  the  partial  trajectories  whose  temporal  extents  overlap  the  current  window.  Except  for  this 
difference,  the  computations  are  the  same  as  in  the  first  window.  This  process  repeats  forever,  and  creates 
trajectories  that  an  additional  stage,  based  on  a  similar  sliding  window  and  one  more  binary  integer  program 
computation,  partitions  into  identities. 


2.3  Performance  Measures 

While  theory  guides  descriptor  and  algorithm  design,  evidence  of  good  performance  must  in  the  end  be 
empirical.  Current  performance  measures  such  as  Multiple  Object  Tracking  Accuracy  (MOTA)  lfl8l  apply 
mainly  to  single-camera  tracking.  Even  there,  they  suffer  from  various  weaknesses  [70 ,  61]  that  can  be 
traced  back  to  the  truth-to-result  matching  problem  defined  in  the  introduction:  Which  of  several  computed 
identities  that  match  part  of  a  given  ground-truth  trajectory  gets  how  much  credit  for  its  match? 

Current  measures  tailored  to  multi-camera  tracking  emphasize  inter-camera  errors  rather  than  within- 
camera  ones,  since  they  focus  on  handover  errors  ff54l.  Specifically,  fragments  count  true  identities  that 
correspond  to  multiple  computed  ones,  either  across  (X-FRG)  or  within  (R-FRG)  cameras,  and  ID  switches 
count  the  reverse  error,  that  is,  multiple  true  identities  falsely  linked  by  the  computation,  either  across  (X- 
IDS)  or  within  (R-IDS)  cameras.  These  measures  suffer  from  the  truth-to-result  matching  problem  as  well. 

A  recent  measure  aimed  at  identity  management  [f29t  MCTA,  combines  true  positive  accuracy  with 
resilience  to  identity  switches  within  and  between  cameras.  Identity  switches  between  cameras  are  counted 
when  there  is  an  identity  change  between  the  last  frame  of  the  true  trajectory  in  one  camera  and  the  first 
frame  of  the  same  trajectory  in  a  different  camera.  This  is  clearly  not  the  best  mapping  choice  in  case 
of  trajectory  fragmentation,  where  the  correct  identity  might  be  correctly  tracked  for  large  part  of  the  true 
trajectory  and  eventually  lost  in  late  frames.  When  a  person  returns  to  the  same  camera  but  the  mapping 
has  changed  this  measure  in  addition  doesn’t  count  this  error  as  a  between-camera  ID  switch  but  as  a  single 
camera  ID  switch. 
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To  address  these  issues,  we  first  solve  the  truth-to-result  matching  problem,  and  then  build  standard 
measures  such  as  precision,  recall,  and  F\ -score  on  top  of  that  solution. 

2.3.1  Matching  True  and  Computed  Identities 

Our  solution  is  based  on  the  following  remarks:  (1)  A  correct  match  (no  fragmentation  or  ID  switches) 
between  true  identities  and  computed  identities  is  one-to-one.  (2)  A  fair  truth-to-result  matching  is  most 
favorable  to  the  algorithm.  (3)  A  fair  penalty  for  an  error  of  either  type  (ID  switch  or  fragmentation  )  is 
the  number  of  mis-assigned  frames.  Together,  these  considerations  suggest  computing  a  bipartite  matching 
between  computed  and  true  identities  with  a  minimal  number  of  mis-assigned  frames. 

To  this  end,  we  construct  a  bipartite  graph  G  =  (Vr,  E)  as  follows.  Vertex  set  Vr  has  one  “regular” 
node  r  for  each  true  trajectory  and  one  “false  positive”  node  /+  for  each  computed  trajectory  7.  Vertex  set 
Vc  has  one  “regular”  node  7  for  each  computed  trajectory  and  one  “false  negative”  node  /“,  for  each  true 
trajectory  r.  Two  regular  nodes  are  connected  with  an  edge  e  G  E  if  they  overlap  in  time.  Every  regular 
true  node  r  is  connected  to  its  corresponding  /“,  and  every  regular  computed  node  7  is  connected  to  its 
corresponding  /+. 

The  cost  on  an  edge  (r,  7)  G  E  tallies  the  number  of  false  negative  and  false  positive  image  frames  that 
would  be  incurred  if  that  match  were  chosen.  Specifically,  let  r(t)  be  the  sequence  of  detections  for  true 
trajectory  r,  one  detection  for  each  image  frame  t  in  the  set  Tr,  and  define  7 (t)  for  t  G  7^  similarly  for 
computed  trajectories.  The  two  simultaneous  detections  r(t)  and  7 (t)  are  a  mismatch  if  they  do  not  overlap 
in  space,  and  we  write 

m(t,7  ,t,A)  =  l.  (17) 

When  both  r  and  7  are  regular  nodes,  spatial  overlap  between  two  detections  can  be  measured  either  in  the 
image  plane  or  on  the  reference  ground  plane  in  the  world.  In  the  first  case,  we  declare  a  mismatch  when 
the  area  of  the  intersection  of  the  two  detection  boxes  is  less  than  0  <  A  <  1  times  the  area  of  the  union 
of  the  two  boxes.  On  the  ground  plane,  we  declare  a  mismatch  when  the  positions  of  the  two  detections  are 
more  than  A  =  1  meter  apart.  If  there  is  no  mismatch,  we  write  /i(r,  7,  t ,  A)  =  0.  When  either  r  or  7  is 
an  irregular  node  (/“  or  /+ ),  any  detections  in  the  other  trajectory  are  mismatches.  When  both  r  and  7  are 
irregular,  /1  is  undefined. 

With  this  definition,  the  cost  on  edge  (r,  7)  G  E  is  defined  as  follows: 

c(r,  7,  A)  =  m(t,  7,  t,  A)  +  yy  h(t,  7,  t,  A)  (18) 

tCiTr 

s - - - '  " - V - ' 

False  Negatives  False  Positives 

We  define  costs  in  terms  of  binary  mismatches,  rather  than,  say,  Euclidean  distances,  so  that  a  mismatch 
between  regular  positions  has  the  same  cost  as  a  mismatch  between  a  regular  position  and  an  irregular  one. 
Matching  two  irregular  trajectories  incurs  zero  cost  because  they  are  empty. 

A  minimum-cost  solution  to  this  bipartite  matching  problem  determines  a  one-to-one  matching  (consis¬ 
tently  with  remark  1  above)  that  minimizes  the  cumulative  false  positive  and  false  negative  errors  (remark  2), 
and  the  overall  cost  is  the  number  of  mis-assigned  frames  (remark  3).  Every  (r,  7)  match  is  a  True  Positive 
(TP).  Every  (/+,  7)  match  is  a  False  Positive  ( FP ).  Every  (r,  f~ )  match  is  a  False  Negative  (FN).  Every 
(/+,  f~ )  match  is  a  True  Negative  (TN). 
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2.3.2  Performance  Scores 


We  use  the  TP,  FP,  FN  counts  from  bipartite  matching  to  compute  precision  P  and  recall  R : 

TP  TP  _  TP  TP 
~  TP  +  FP  ~  C  ’  ~  TP  +  FN  ~  T  K  J 

where  C  —  TP  +  FP  is  the  number  of  computed  detections  and  T  —  TP  +  FN  is  the  number  of  true 
detections.  Precision  is  the  fraction  of  computed  detections  that  are  correct.  Recall  is  the  fraction  of  true 
detections  that  are  computed.  The  F\  score  is  a  single  figure  that  balances  precision  and  recall  and  is  the 
fraction  of  computed  detections  per  average  number  of  true  and  computed  detections: 


Fi 


2^- 
P  +  R 


TP 


T+C  ’ 
2 


(20) 


Precision,  recall,  and  Fi-score  are  based  on  a  mapping  that  is  computed  jointly  over  all  trajectories  by 
optimizing  a  single  objective  function  and  that  treats  within-  and  across-camera  matches  uniformly.  Their 
definition  is  conceptually  simpler  than  MOTA  and  MCTA,  as  it  entails  one  global  optimization  rather  than 
one  optimization  problem  per  frame.  Precision  and  recall  shed  light  on  tracking  trade-offs,  while  the  F\ 
score  allows  ranking  all  trackers  on  a  single  scale. 

The  bipartite  matching  figures  also  reveal  the  generalization  cost  incurred  by  solving  the  tracking  prob¬ 
lem  simultaneously  for  all  cameras  rather  than  separately  for  each  camera.  Specifically,  let 


Em  =  FPm  +  ENm  and  E$  =  FP$  +  EN$ 


(21) 


be  the  total  number  of  errors  for  the  multi-camera  solution  and  for  the  solution  obtained  for  separate  cameras. 
Then, 

Em  >  Es  (22) 

because  the  multi-camera  solution  is  feasible  for  separate  cameras,  but  not  necessarily  vice  versa.  So  the 
difference  Em  —  Es  >  0  can  be  interpreted  as  the  fitting  cost  that  has  to  be  paid  to  make  single-camera 
solutions  hold  across  all  cameras.  Simple  manipulation  also  shows  that  the  F\  score  for  the  multi-camera 
solution  is  no  greater  than  that  for  separate  camera  solutions  is  upper  bounded  by  that  for  separate  camera 
solutions: 

F™  <  F?  .  (23) 

Our  evaluation  methodology  addresses  even  the  most  glaring  issues  with  MOTA  and  MCTA.  For  MOTA, 
consider  a  single-camera  true  trajectory  with,  say  16  frames,  and  suppose  that  the  system  fragments  this 
identity  into  two,  with  eight  consecutive  frames  each.  Both  MOTA  and  MCTA  give  a  score  of  93%  to  this 
outcome  that  seems  half  wrong.  MCTA  gives  a  correct  score  of  50%  if  the  true  trajectory  is  split  in  half 
over  two  cameras,  but  then  it  is  puzzling  that  it  scores  what  is  effectively  the  same  error  so  differently  for 
different  numbers  of  cameras.  In  contrast,  the  F\  score  based  on  our  evaluation  paradigm  assigns  50%  to  all 
these  cases — a  more  plausible  and  fully  consistent  score. 
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2.4  Experiments 


Despite  the  large  body  of  literature  tackling  identity  management,  the  lack  of  public  and  realistic  data  sets, 
unpublished  codes  from  other  methods,  and  inconsistencies  in  the  evaluation  protocol  make  comparative 
performance  evaluation  difficult.  In  this  Section,  we  first  evaluate  our  methods  on  standard  benchmark  data 
and  with  standard  performance  evaluation  measures  to  show  that  our  methods  improve  on  the  state  of  the  art 
in  several  aspects.  These  comparisons  are  made  on  single-camera  data  sets,  to  which  most  of  the  published 
results  apply.  After  that,  we  describe  our  own,  new  multi-camera  data  set  and  propose  preliminary  results 
on  it.  We  cannot  compare  these  results  directly  with  the  literature,  since  our  data  set  is  not  yet  published. 
Instead,  we  make  a  few  comparisons  on  smaller,  existing  data  sets. 


2.4.1  Single-Camera  Experiments 

We  evaluate  our  algorithm  on  three  standard  single-camera  data  sets  for  multi-person  tracking:  PETS2009 
|42l.  Town  Center  ff5l  and  Parking  Lot  |QQ|.  We  used  the  PETS2009-S2L1  View  1  sequence,  which  has 
a  resolution  of  768  x  576  pixels  and  consists  of  798  frames  at  7  fps  (117  seconds).  The  scene  is  not 
heavily  crowded,  but  the  low  predictability  in  people’s  motion  and  a  few  long  occlusions  behind  a  lamp  post 
makes  the  sequence  challenging.  The  Town  Center  sequence  is  more  challenging  because  it  is  longer,  more 
crowded,  and  has  longer  occlusions.  Occlusions  in  this  sequence  are  mainly  caused  by  people  walking  very 
close  to  each  other.  The  sequence  has  a  resolution  of  1920  x  1080  pixels  and  consists  of  4500  frames  at  25 
fps  (180  seconds).  The  Parking  Lot  sequence  consists  of  998  frames  at  30  fps  (33.26  seconds)  and  has  a 
resolution  of  1920  x  1080  pixels.  This  sequence  is  challenging  because  it  is  filmed  from  an  oblique  angle 
and  several  people  have  similar  appearance.  Also,  people  walk  close  to  each  other  in  parallel  causing  long 
occlusions,  both  partial  and  full. 

We  use  the  standard  Multiple  Object  Tracking  Accuracy  (MOTA)  score  OH  to  evaluate  the  performance 
of  our  algorithm.  This  score  combines  the  number  of  false  positives  fp(t),  false  negatives  fn(t ),  and  identity 
switches  id(t)  over  all  frame  indices  t  as  follows: 


MOTA  =  1 


Et(/p(*)  +  fn(t)  +  id(t)) 

E 1 9(t) 


(24) 


where  g(t)  is  the  ground-truth  number  of  people  in  frame  t.  MOTA  is  widely  accepted  in  the  field  as  one  of 
the  principal  indicators  of  a  tracker’s  performance. 

In  Table  [2]  we  present  results  for  all  sequences.  We  outperform  state  of  the  art  methods  in  MOTA  and 
identity  switches.  For  a  fair  comparison,  we  use  the  detections  used  in  previous  work  (T),  courtesy  of  the 
authors.  All  evaluations  are  done  using  the  CLEAR  MOT  evaluation  script  J31  and  we  use  the  standard  1 
meter  acceptance  threshold. 

In  the  PETS2009  sequence  we  use  a  long  temporal  window  of  20  seconds  and  one  appearance  group 
since  the  scene  is  not  crowded.  We  allow  tracklets  to  be  at  most  10  frames  in  this  sequence  due  to  its  low 
frame  rate.  The  total  running  time  of  our  method,  not  accounting  for  person  detection,  is  38  seconds.  In 
the  Town  Center  sequence  we  use  a  temporal  window  of  12  seconds  and  5  appearance  groups  because  the 
sequence  is  more  crowded.  Tracklets  have  lengths  of  at  most  20  frames.  The  total  running  time  on  this 
sequence  is  176  seconds,  120  of  which  were  spent  finding  all  tracklets.  In  the  Parking  Lot  sequence  we  use 
a  temporal  window  of  6  seconds  and  tracklets  are  at  most  20  frames  long.  We  used  one  appearance  group  in 
this  sequence  since  it  is  short.  The  total  running  time  on  this  sequence  is  34  seconds. 
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PETS 2009 

Town  Center 

MOTA 

IDsw 

MOTA 

IDsw 

Berclaz  fTTIl 

80.00 

28 

B  enfold  El 

64.9 

259 

Shitrit  [12] 

81.46 

19 

Zhang  |79l 

65.7 

114 

Andriyenko  (3) 

81.84 

15 

Leal-Taixe  l55l 

67.3 

86 

Henriques  l46l 

87.95 

10 

Izadinia  PT71 

75.70 

- 

Izadinia  (47l 

90.70 

- 

Zamir  [Q] 

75.59 

- 

Zamir  (U 

91.50 

8 

McLaughlin  [  60] 

76.46 

- 

Ours 

93.34 

1 

Ours 

78.43±0.29 

68 

Parking  Lot 

MOTA 

IDsw 

Izadinia  l47l 

88.90 

- 

Zamir  Q 

92.27 

1 

Ours 

94.20 

1 

Table  2:  Multi  Object  Tracking  Accuracy  (MOTA)  and  ID  switches  on  three  standard  data  sets.  MOTA 
variance  for  the  Town  Center  sequence  was  caused  by  differences  in  appearance  groups  in  different  runs  and 
resulting  from  the  randomness  of  the  fc-means  clustering  algorithm. 


Window  Length,  Accuracy,  and  Runtime  Figures  [3ja)  and[3jb)  show  the  dependency  of  tracking  accu¬ 
racy  and  running  time  on  the  length  of  the  sliding  window  for  the  Town  Center  Sequence  with  10  appearance 
groups.  We  ran  several  experiments  on  this  sequence  by  progressively  elongating  the  temporal  window. 

Figure  [3ja)  shows  that  after  the  temporal  window  length  is  increased  beyond  3  seconds,  which  corre¬ 
sponds  to  the  typical  occlusion  length  in  the  scene,  there  is  no  significant  improvement  in  the  quality  of  the 
solution.  The  variations  in  the  graph  are  caused  by  differences  in  the  appearance  groups  that  the  fc-means 
algorithm  finds  in  each  window.  The  slight  decrease  in  the  scores  for  windows  longer  than  19  seconds  is 
because  the  parameter  (3  in  Equation  ([3])  also  influences  how  large  partitions  can  grow  in  time. 

Figure [3jb)  shows  how  the  sliding  window  length  affects  the  running  time.  Appearance  grouping  allows 
us  to  achieve  an  unprecedented  temporal  window  length  for  real-time  computation. 

Appearance  Groups,  Accuracy,  and  Runtime  Figures  |3jc)  and  |3jd)  show  the  dependency  of  tracking 
accuracy  and  running  time  on  the  number  of  appearance  groups  for  the  Town  Center  sequence  with  a  tem¬ 
poral  window  of  8  seconds. 

Figure  [3jc)  shows  that  even  a  moderately  high  number  of  appearance  groups,  around  20,  has  negligible 
harmful  effects  on  the  accuracy  of  the  tracker.  When  the  number  of  appearance  groups  is  increased  further, 
the  accuracy  measure  starts  to  decay  because  identities  are  split  into  separate  groups.  The  fluctuations  in  the 
graph  are  again  caused  by  the  k- means  algorithm,  which  over-clusters  in  windows  that  contain  few  tracklets. 

Figure [3jd)  shows  that  the  overall  running  time  is  greatly  reduced  when  we  go  from  1  to  about  5  appear¬ 
ance  groups,  while  the  MOTA  score  only  drops  from  79%  to  78.4%  (Figure [3jc)).  Increasing  the  number  of 
appearance  groups  further  yields  marginal  reductions  in  running  time.  The  slight  increase  in  total  runtime 
for  more  than  20  appearance  groups  is  caused  by  the  fc-means  algorithm,  whose  complexity  increases  with 
k. 
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Figure  3:  MOTA  scores  (a,  c)  and  running  times  (b,  d)  as  functions  of  the  length  of  the  sliding  window  (a, 
b)  and  the  number  of  appearance  groups  (c,  d)  for  the  Town  Center  sequence.  Solver  time  indicates  how 
much  time  was  spent  for  assembling  and  solving  all  the  Binary  Integer  Programs.  The  total  running  time 
also  includes  the  time  for  computing  correlations,  but  does  not  account  for  person  detection.  Figures  (a)  and 
(b)  are  for  ten  appearance  groups,  and  Figures  (c)  and  (d)  are  for  8-second  sliding  windows.  Best  viewed  on 
screen. 


Approximate  and  Exact  Graph  Partitioning  Solvers  We  explore  the  trade-off  between  accuracy  and 
runtime  for  different  combinations  of  solvers  for  graph  partitioning.  We  demonstrate  that  approximating  the 
solution  of  multi-person  tracking  by  piecing  together  exact  solutions  of  small  subproblems  is  qualitatively 
better  than  algorithms  with  no  optimality  guarantees,  while  still  achieving  real-time  performance. 

Three  algorithms  for  graph  partitioning  have  been  recently  proposed  in  the  literature  J6j|,  namely: 
Expand-and-Explore,  Swap-and-Explore,  and  Adaptive  Label  Iterative  Conditional  Modes  (AL-ICM).  We 
use  the  latter  in  our  experiment  because  of  its  speed  and  ability  to  scale  to  large  problems.  Given  a  labeling 
vector  L  =  {1,  2, . .  .}n  the  algorithm  assigns  a  label  lu  to  observation  u  so  as  to  minimize  the  following 
energy  function: 

E{L)  =  wuyl\lu^lv]  (25) 

uv 

where  l[pj  is  1  when  P  is  true  and  0  otherwise.  This  energy  is  lowered  when  observations  supported 
by  negative  correlation  are  labeled  differently  and  when  observations  supported  by  positive  correlation  are 
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PETS 2009 

Town  Center 

Parking  Lot 

Method 

MOTA 

Runtime 

Solver 

MOTA 

Runtime 

Solver 

MOTA 

Runtime 

Solver 

Izadinia  07) 

90.70 

- 

- 

75.70 

- 

- 

88.90 

- 

- 

Zamir  0 

91.50 

- 

- 

75.59 

- 

- 

92.27 

- 

- 

AL-ICM 

91.34 

9.33 

0.31 

77.78  ±  .35 

107.54 

3.52 

93.33 

15.69 

0.39 

U 

AL-ICM-NoGroup 

92.20 

10.68 

0.45 

78.46 

284.73 

18.86 

93.92 

28.00 

1.11 

o 

BIP 

93.18 

17.06 

8.06 

78.43  d=  .29 

177.17 

73.10 

94.20 

33.59 

20.40 

BIP-NoGroup 

93.18 

27.43 

16.87 

78.87 

25725.82 

23444.58 

94.20 

334.45 

307.39 

Table  3:  Different  combinations  of  solvers  evaluated  on  three  standard  data  sets.  The  length  of  each  sequence 
is  1 17,  180  and  33.26  seconds  respectively.  Solver  time  indicates  how  many  seconds  were  spent  for  solving 
graph  partitioning  problems  in  each  sequence.  The  total  running  time  also  includes  the  time  for  computing 
correlations,  but  does  not  account  for  person  detection. 


labeled  identically.  This  discrete  energy  minimization  formulation  has  the  advantage  that  the  labeling  vector 
L  consists  of  n  variables  whereas  the  co-identity  matrix  X  in  our  formulation  consists  of  n2  variables.  This 
allows  AL-ICM  to  scale  to  n  >  100, 000  observations. 

AL-ICM  is  a  greedy  search  algorithm.  In  each  iteration,  every  variable  is  assigned  the  label  that  mini¬ 
mizes  the  energy,  conditioned  on  the  current  label  of  the  other  variables.  While  ICM  requires  a  fixed  number 
of  labels  [  19],  AL-ICM  handles  a  varying  number  of  labels  as  follows:  conditioned  on  the  current  labeling, 
each  observation  is  assigned  to  the  most  rewarding  partition,  or  to  a  new  partition  if  penalized  by  all  cur¬ 
rent  partitions.  The  algorithm  terminates  either  when  the  energy  cannot  be  minimized  further  or  when  a 
predefined  number  of  iterations  is  reached. 

We  construct  two  methods  based  on  this  algorithm.  Method  AL-ICM  uses  the  greedy  algorithm  in 
stages  2  and  4  of  the  cascade,  and  space-time  and  appearance  grouping  in  stages  1  and  3.  Method  AL-ICM- 
NoGroup  uses  the  greedy  algorithm  but  no  grouping,  thus  only  stages  2  and  4  of  the  full  cascade. 

We  refer  to  our  full  algorithm  as  BIP,  and  we  compare  it  also  to  a  method  we  call  BIP-NoGroup,  that 
is,  stages  2  and  4  of  the  cascade  without  space-time  and  appearance  grouping.  Performance  metrics  for  all 
methods  on  three  sequences  are  presented  in  Table  [3] 

Accuracy.  All  our  methods  consistently  outperform  the  state  of  the  art.  Even  method  AL-ICM  is 
on  par,  if  not  better  than  the  state  of  the  art,  although  it  can  be  penalized  by  mistakes  due  to  grouping 
heuristics  and  the  suboptimal  greedy  algorithm.  The  differences  in  accuracy  between  our  methods  that  use 
grouping  and  their  corresponding  version  without  grouping  is  minimal.  This  confirms  that  stages  1  and  3 
of  our  cascade  can  be  used  in  practice,  safely  and  with  negligible  harmful  effects.  It  is  also  worth  noting 
that  piecing  together  optimal  solutions  of  small  problems  is  superior  to  combining  approximate  solutions 
of  small  problems,  which  is  common  in  the  literature:  Both  BIP  and  BIP-NoGroup  perform  better  than 
AL-ICM  and  AL-ICM-NoGroup,  respectively. 

Runtime.  It  is  not  surprising  that  the  AL-ICM  algorithm  is  much  faster  than  the  BIP  solver.  AL- 
ICM  is  a  greedy  algorithm  and  does  not  require  assembling  and  solving  a  BIP  with  a  quadratic  number 
of  variables  and  a  combinatorial  number  of  constraints.  We  note  that  the  use  of  grouping  heuristics  is 
crucial  for  improving  runtime  performance;  methods  that  do  not  use  heuristics  need  to  compute  large  and 
full  correlation  matrices.  While  the  best  time  performance  is  that  of  AL-ICM,  our  BIP  method  is  also  fast 
enough  to  work  in  real-time  at  25  fps. 

Trade-offs.  Considering  the  trade-offs  between  accuracy  and  runtime,  the  BIP  approach  is  appropriate 
when  accuracy  is  important  and  the  scene  has  medium  crowd  density.  The  AL-ICM  variant  is  more  appro¬ 
priate  for  time-critical  applications  or  more  crowded  scenes,  but  comes  with  a  cost  in  terms  of  accuracy.  In 
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MOTAf 

MOTPt 

PRECf 

RECf 

IDS| 

KSP 

65.56 

64.70 

84.26 

80.89 

146 

Ours 

70.10 

60.50 

89.02 

80.46 

281 

Table  4:  Comparison  of  our  clique-based  optimization  approach  against  KSP’s  path-based  formulation  on 
the  ISSIA  data  set.  Arrows  up  (down)  mean  that  greater  (smaller)  is  better. 


the  absence  of  heuristics,  which  are  not  useful  when  the  scene  is  crowded  or  all  appearances  look  the  same, 
AL-ICM-NoGroup  is  the  only  method  from  the  above  set  that  can  be  used  to  meet  weaker  time  constraints. 

2.4.2  Multi- Camera  Experiments 

A  New  Data  Set  Our  new  Duke  Chapel  data  set  is  recorded  with  8  video  cameras  pointed  to  a  large  outdoor 
area  of  the  Duke  University  campus.  Video  is  recorded  at  1080p  resolution  and  60  frames  per  second.  All 
cameras  are  calibrated  and  synchronized  to  within  1  second.  Two  camera  pairs  have  small  overlapping  fields 
of  view,  while  the  fields  of  view  of  other  cameras  are  disjoint.  The  data  set  is  1  hour  25  minutes  long  for 
each  camera  and  has  more  than  1  million  frames. 

Exterior  calibration  information  is  available  for  each  camera.  Combined  with  manual  annotation  of 
every  detection  in  every  frame  from  every  camera,  this  information  allows  providing  ground  truth  in  the 
form  of  a  trajectory  for  each  identity  and  camera  on  the  world  ground  plane.  Image  bounding  boxes  for  each 
person  in  every  frame  are  also  available  and  have  been  generated  semi-automatically. 

The  data  set  includes  6791  trajectories  for  2834  different  identities  (distinct  people),  for  an  average  of 
2.5  trajectories  per  identity.  Some  identities  have  up  to  7  trajectories,  meaning  that  the  corresponding  people 
appear  and  reappear  up  to  7  times  as  they  walk  around  campus.  The  time  period  over  which  the  cameras 
were  on  included  intervals  between  scheduled  classes,  in  which  pedestrian  traffic  is  heavy.  As  a  result,  there 
are  between  zero  and  54  people  per  frame.  The  total  number  of  hand-overs — transitions  of  the  same  person 
from  one  camera  to  the  next — is  4159,  while  the  maximum  number  of  people  simultaneously  traversing 
blind  spots  is  50. 

The  first  5  minutes  of  video  from  each  of  the  cameras  have  been  set  aside  as  a  training  and  validation  set 
and  can  be  used  to  manually  set  or  automatically  estimate  algorithm  parameters. 


Evaluation  of  our  Trajectory  Correlation  Measure  Our  first  multi-camera  experiment  evaluates  the 
methods  we  described  in  Section  |2.2.4|  to  compare  trajectories.  We  compare  our  method  to  a  widespread 
algorithm  that  uses  K-Shortest  Paths  (KSP)  fl~3  , 161.  The  experiment  is  conducted  on  the  ISSIA  data  set  @0], 
which  is  a  3  minutes  soccer  scene  comprising  25  targets  (11  from  each  team  and  3  referees),  recorded  by  6 
cameras  with  different  levels  of  overlap  between  views  and  no  blind  spots.  This  setting  let  us  test  the  ability 
of  our  method  to  take  advantage  of  redundant  data  from  overlapping  views,  as  opposed  to  KSP  which  is 
specifically  designed  to  work  under  the  hypothesis  of  no  blind  spots.  Moreover,  to  emphasize  the  merit  of 
our  trajectory  correlation  measure  independently  from  how  trajectories  are  described,  we  employ  a  simple 
HSV  histogram  (16,  4  and  4  bins  respectively)  to  describe  detections.  For  a  fair  comparison,  both  methods 
are  run  on  the  same  input  detections  f4H|  and  on  the  whole  length  of  each  sequence.  KSP  parameters  have 
been  set  according  to  the  authors’  guidelines  to  obtain  best  results. 

Results  reported  in  Table  [4]  show  that  our  method,  despite  not  being  specifically  designed  to  work  for 
overlapping  views,  is  still  able  to  exceed  the  state  of  the  art  in  both  accuracy  and  precision.  KSP’s  lower 
number  of  identity  switches  and  fragmentations  is  due  to  their  additional  entry/exit  constraints:  No  soccer 
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Figure  4:  Retrieval  results  on  a  subset  of  our  data  set  (Sec.  |2.4.2j)  based  on  appearance  only.  Existing 
methods  fail  in  exploiting  the  availability  of  more  frames  to  increase  the  descriptor  robustness. 


player  is  allowed  to  enter  or  exit  the  field  during  the  sequence,  forcing  trajectories  to  always  exist.  We  do 
not  take  advantage  of  these  constraints. 

Conversely  to  the  previous  experiment,  we  now  evaluate  our  trajectory  descriptors  in  isolation,  without 
global  matching  optimization.  To  this  end,  we  extracted  100  pairs  of  trajectories,  two  for  each  of  100 
identities,  from  our  Duke  Chapel  data  set.  These  trajectories  come  with  bounding  boxes  in  each  frame,  and 
are  automatically  extracted  by  employing  the  first  two  layers  of  our  identity  management  pipeline.  For  a 
comparison,  we  reimplemented  a  previous  method,  Sarc3D  |[8j|,  and  tested  it  on  our  data.  All  the  parameters 
in  Sarc3D  have  been  set  to  achieve  best  retrieval  accuracy.  This  method  shares  some  similarities  with  ours 
in  that  it  addresses  view  invariance  through  the  same  3D  model  and  was  designed  to  work  with  multiple 
(not  necessarily  continuous)  snapshots.  We  also  compare  against  a  simple  baseline  descriptor,  a  histogram 
of  HSV  features  (with  16,  4  and  4  bins  in  three  different  trials)  of  the  pixels  contained  in  the  top,  middle, 
and  bottom  third  of  each  detection  box,  averaged  over  the  frames  of  a  trajectory. 

The  ordinate  of  the  plot  in  Figure[4]shows  the  number  of  correct  matches  that  are  retrieved  in  the  first  top 
k  ranked  positions,  where  k  is  shown  on  the  abscissa.  Our  method  is  retrieves  30%  more  correct  matches  as 
the  top-ranked  one  when  compared  with  other  methods,  and  retrieves  90%  of  all  correct  matches  in  the  top 
three  positions.  For  others  method  we  also  run  experiments  by  increasing  the  number  of  regularly  sampled 
frames  along  the  trajectory,  as  noted  in  the  legend  of  the  plot.  After  a  few  frames,  performance  saturates 
and  then  worsens  over  time,  as  a  result  of  the  lack  of  a  proper  way  to  aggregate  and  compare  redundant 
descriptors.  Sarc3D  and  the  HSV  baseline  perform  similarly  to  each  other,  with  ours  a  clear  winner. 

A  First  Large-Scale  Multi-Camera  Experiment  Table  [5]  summarizes  standard  performance  figures,  as 
well  as  our  own  new  measures,  on  the  Duke  Chapel  data  set.  This  table  is  meant  as  a  set  of  reference  figures 
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FP  ; 

fn; 

ids; 

CLEAR  MOT  Metrics 

FRG  4  MOTA  t  MOTP  t 

GT 

MTt 

ml; 

Proposed  Metrics 

p;  r;  Fit 

Camera  1 

9.70 

52.90 

178 

366 

37.36 

67.57 

1175 

105 

128 

79.17 

44.97 

57.36 

Camera  2 

21.48 

29.19 

866 

1929 

49.17 

61.70 

1106 

416 

50 

69.11 

63.78 

66.34 

Camera  3 

7.04 

39.39 

134 

336 

53.50 

63.57 

501 

229 

42 

81.46 

55.11 

65.74 

Camera  4 

10.61 

33.42 

107 

403 

55.92 

66.51 

390 

128 

21 

79.23 

61.16 

69.03 

Camera  5 

3.48 

23.38 

162 

292 

73.09 

70.52 

644 

396 

33 

84.86 

67.97 

75.48 

Camera  6 

38.62 

48.21 

1426 

3370 

12.94 

48.62 

1043 

207 

91 

48.35 

43.71 

45.91 

Camera  7 

8.28 

29.57 

296 

675 

62.03 

60.73 

678 

373 

53 

85.23 

67.08 

75.07 

Camera  8 

1.29 

61.69 

270 

365 

36.98 

69.07 

1254 

369 

236 

90.54 

35.86 

51.37 

Over  all  cameras 

14.38 

43.85 

3439 

7736 

41.66 

63.54 

6791 

2223 

654 

72.25 

50.96 

59.77 

Multi-Camera 

Our  method  with  proposed  3D  descriptor 

48.90 

34.09 

40.17 

Multi-Camera 

Our  method  with  2D  HSV  feature 

48.61 

33.88 

39.93 

Table  5:  Single-camera  (white  rows)  and  multi-camera  (grey  rows)  results  on  the  Duke  Chapel  data  set.  For 
each  camera  we  report  both  standard  Multi-Target  tracking  measures  as  well  as  our  new  measure.  FP,  FN, 
P,  R  and  Fi  are  percentage  values.  Arrows  up  (down)  mean  that  greater  (smaller)  is  better. 


to  which  future  algorithms  may  be  compared.  No  comparison  is  currently  possible,  as  no  other  method,  at 
the  time  of  this  writing,  has  been  evaluated  on  data  sets  of  this  size  and  complexity. 

Computation  Time  We  implemented  our  algorithms  in  MATLAB  and  we  used  the  Gurobi  Optimizer 
to  solve  the  Binary  Integer  Programs.  All  experiments  were  done  on  a  PC  with  Intel  i7-3610  2.3  GHz 
processor  and  6  GB  of  memory.  The  results  for  the  BIP-NoGroup  method  in  Table  [3]  were  produced  on  a 
Linux  machine  with  Intel  Xeon  E5540  2.53  GHz  processor  and  96  GB  memory  in  order  to  solve  very  large 
Binary  Integer  Programs. 

Our  approach  takes  on  average  45  minutes  per  camera  to  compute  trajectories  for  an  80-minute  video, 
without  counting  person  detection.  It  takes  an  additional  30  minutes  to  aggregate  trajectories  into  identities. 
Since  trajectories  can  be  computed  in  parallel  for  different  cameras,  and  person  detectors  can  be  run  in  real 
time  on  a  dedicated  GPU  for  each  camera,  our  method  achieves  real-time  performance.  However,  we  have 
not  built  an  end-to-end  multi-camera  system  because  we  did  not  have  the  necessary  computation  hardware 
and  programming  manpower. 

2.5  Conclusion 

The  identity  management  developed  under  this  grant  is  the  first  end-to-end  system  that  has  shown  to  be  able 
to  track  thousands  of  people  from  multiple  cameras.  Our  method  uses  the  same  optimization  framework  at 
all  levels  and  relies  on  a  novel  measure  of  trajectory  similarity  that  aggregates  information  over  time  with 
reduced  blurring.  We  created  a  new,  large,  fully  annotated  data  set  that  we  are  about  to  make  available  to  the 
research  community,  and  we  developed  a  new  measure  for  performance  evaluation  that  treats  within-camera 
and  across-camera  errors  uniformly  and  matches  the  complexity  of  the  task.  The  plug-and-play  nature  of 
our  system  along  with  our  new  data  set  and  evaluation  methodology  make  past  and  future  work  easier  to 
evaluate  consistently. 

While  we  solve  each  binary  integer  program  optimally  at  each  stage  of  the  pipeline,  we  lose  performance 
guarantees  when  the  stages  are  combined.  Our  experiments  show  that  we  can  achieve  real-time  identity 
management  with  good  precision  and  recall  performance.  However,  solving  the  entire  problem  optimally  is 
an  NP-hard  problem  that  is  still  beyond  the  state  of  the  art.  We  hope  that  our  contributions  will  help  keep 
pushing  improvements  of  performance  in  this  important  area. 
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A  Dense,  Long-Term  Visual  Tracking 


The  goal  of  long-range,  high-density  motion  estimation  in  video  analysis  is  to  compute  the  life  of  every 
point  in  a  dense  sampling  of  the  visible  surfaces  in  the  scene.  The  image  projection  of  a  scene  point  moves 
along  a  path  in  the  image  plane.  Sometimes  the  point  is  visible,  and  sometimes  it  is  occluded  by  some  object 
in  the  world  or  by  the  boundaries  of  the  image.  In  a  dense  motion  estimate,  at  least  one  path  passes  through 
every  pixel  of  the  sequence. 

Dense,  long-range  motion  estimation  supports  a  number  of  applications.  The  computed  paths  can  prop¬ 
agate  to  multiple  frames  any  annotations  or  edits  made  in  a  single  frame,  thereby  easing  video  labeling  and 
editing.  If  visible  paths  can  be  extrapolated  into  regions  where  they  are  occluded,  the  occluding  object  can 
be  removed  from  the  video  by  painting  the  pixels  it  occupies  with  the  extrapolated  colors.  Videos  can  be 
segmented  into  separate  objects  by  clustering  paths  into  coherent  groups.  The  shapes  and  appearance  of  the 
resulting  tube-like  regions  can  support  the  detection  and  recognition  of  objects  and  activities. 

Image  motion  information  is  either  poor  or  altogether  unavailable  where  the  scene  has  little  or  no  vi¬ 
sual  texture — the  so-called  aperture  problem.  As  a  consequence,  regularization — or  priors  in  probabilistic 
parlance — must  be  employed  to  extrapolate  motion  information  from  textured  to  poorly  textured  regions.  To 
this  end,  we  assume  that  (i)  image  paths  live  in  a  low-dimensional  space,  (ii)  appearance  remains  approxi¬ 
mately  constant  along  the  visible  portion  of  a  path,  and  (iii)  exactly  one  world  point  is  visible  at  every  image 
point.  The  first  assumption  is  exactly  satisfied  with  rigid  motion,  and  approximately  satisfied  in  many  cir¬ 
cumstances.  The  second  assumption  is  pervasive  in  motion  analysis,  and  the  third  excludes  semi-transparent 
objects. 

Model 

Let  p  be  an  index  into  a  set  of  paths  xp(£)  :  T  M2,  where  T  is  the  (discrete)  time  domain  of  the  video 
sequence.  A  path  is  visible  at  time  t  iff  its  visibility  flag  vp{t)  :  T  — ^  {0, 1}  is  equal  to  1  at  time  t.  Both 
functions  Xp(£)  and  vp{t)  are  unknowns  to  be  estimated  for  all  paths  in  a  given  video  sequence.  To  ensure 
approximately  (at  first,  and  exactly  later)  at  least  one  path  per  pixel  in  every  frame,  we  anchor  Xp(£)  to  point 
Up  in  some  frame  rp  by  letting  xp(rp)  =  Up,  and  require  enough  anchor  points  to  have  some  path  pass 
through  every  pixel  in  the  video  sequence.  In  contrast  with  LME,  tp  is  path-specific  and  unrestricted. 

Paths  are  assumed  to  be  in  the  space  spanned  by  a  sequence- specific  basis  of  paths  {<^i, . . . ,  pk},  up  to 
a  shift: 

K 

x£>(f)  =  ^  ^  cpk  (( Pk  (t)  —  (Pk(Tp))  •  (26) 

k= 1 

The  motion  relative  to  the  anchor  point  Xp(rp)  =  Up  is  determined  by  the  unknown  coefficients  cp  = 

(cpi, . . . ,  Cpx)- 

Since  paths  in  a  video  with  F  frames  have  F  points,  the  standard  basis  over  R2F  can  represent  any  path 
exactly.  However,  for  many  sequences  a  much  more  compact  (K  «  2 F)  basis  is  adequate,  and  provides 
powerful,  sequence-specific  regularization. 

The  model  in  equation  ([26])  is  Lagrangian,  in  the  sense  that  it  models  individual  points  as  they  move 
through  a  video  sequence.  This  formulation  is  in  contrast  with  the  more  traditional  Eulerian  specification  of 
optical  flow,  in  which  partial  differential  equations  describe  the  flow  that  passes  through  fixed  locations  in 
the  image.  To  rephrase,  the  observer  is  fixed  in  space  in  the  Eulerian  formulation,  and  moves  with  the  flow 
in  the  Lagrangian  one. 
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Given  basis  paths  and  anchor  points,  we  find  paths  and  visibility  flags  by  interleaving  computing  opti¬ 
mal  paths  given  visibility  with  computing  optimal  visibility  given  paths.  The  next  two  sections  define  the 
optimality  criteria  for  these  computations.  The  Sections  thereafter  show  how  to  find  the  path  basis  and  initial 
anchors,  and  how  to  compute  optimal  paths,  visibility,  and  anchors. 

Optimal  paths  Given  a  set  of  basis  paths  and  a  set  of  anchors,  we  find  the  best  motion  coefficients  for 
each  path  by  minimizing  an  objective  function  that  penalizes  changes  in  appearance  along  a  path  (temporal 
smoothness)  and  differences  between  nearby  paths  (spatial  smoothness): 

F 

EE  Ed{cp^)  ^  ^  ^  Es(cp,cq)  .  (27) 

pGV  t=  1  p,qCFP 


The  first  term, 


ED(cp,t)  =  1/p(t)^(I(cp,t)  -  I(cp,Tp))  , 


(28) 


employs  a  robust  penalty  function  ^  (s)  =  Vs2  +  e2  to  measure  the  difference  between  the  image  intensity 
/( cp,  t)  =  /(xp(£))  of  the  path  in  frame  t  and  that  at  the  anchor  up  in  frame  rp.  Multiplication  by  vp{t) 
ensures  that  this  penalty  is  levied  only  on  visible  points.  The  second  term, 


K 

Es{Cp,  Cg)  OLpq  ^  ^  ^ (Cpk  Cqk)  ? 
k= 1 


(29) 


measures  the  difference  between  the  motion  coefficients  of  pairs  of  paths.  The  multiplier  apq  couples  nearby 
paths  that  have  similar  appearance,  and  is  equal  to 


Oipq  =  exp 


(I(cpirp)  7(cg,  r^,))2  \ 

) 


(30) 


if  the  path  p  is  visible  in  the  anchor  frame  of  path  q  (that  is,  if  op(rq)  =  1)  and  passes  close  enough  to  the 
anchor  of  q  (that  is,  if  | \xp(rq)  —  u^|  |  <  A).  Otherwise,  apq  =  0. 


Optimal  visibility  The  binary  visibility  flag  i sp(t)  for  each  path  and  frame  is  modeled  as  a  MRF  whose 
structure  depends  on  the  current  estimates  xp(t)  of  the  paths  p  G  V.  The  MRF  has  one  node  for  each  point 
vp(t)  =  (xp(£),  t)  along  some  path,  for  t  =  1, . . . ,  F,  and  one  binary  random  variable  vp(t)  per  node.  The 
neighborhood  of  vp(t)  is  the  set  of  points  vq(t)  with  q  ^  p  and  || vp(t)  —  vq(t)  ||  <  A  for  some  small  fixed 
A  (spatial  neighborhood),  plus  the  two  points  vp(t  —  1)  and  wp(t  +  1)  that  are  temporally  adjacent  to  vp(t) 
along  path  p  (temporal  neighborhood). 

Each  node  in  the  MRF  is  associated  with  a  binary  observed  visibility  flag  vp{t)  computed  from  the  data 
as  follows.  Path  points  in  each  frame  are  scored  by  their  consistency ,  which  measures  how  little  a  patch 
around  vp(f)  changes  as  it  is  transported  by  the  current  estimates  of  paths  near  vp(t)  to  (i)  a  few  frames 
before  and  after  time  t ,  and  (ii)  the  anchor  frame  rp  for  path  p ,  similar  to  LME.  The  controlling  path  at  vp(t) 
is  the  most  consistent  path  through  the  spatial  neighborhood  of  vp(t).  Let  now 

l  F 

dpi =  -X9(*)II  (31) 

r  t= i 
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Figure  5:  A  spatiotemporal  cube  of  the  marple7  sequence.  Time  runs  from  left  to  right.  The  corner  of  the 
crate  (cyan)  is  first  occluded  by  Miss  Marple’s  arm  (green)  in  frame  12.  A  small  patch  (red  dashed  squares) 
around  each  path  in  every  frame  is  transported  along  the  current  path  estimates  and  monitored  for  consistent 
appearance.  The  arm  patch  (top  right)  is  most  consistent,  and  makes  this  the  controlling  path  at  that  point 
and  frame.  Points  along  paths  that  either  coincide  with  or  are  substantially  parallel  to  a  nearby  controlling 
path  have  their  observed  visibility  flag  vp{t)  set  to  1.  All  other  flags  are  set  to  0.  Observed  flags  affect  the 
estimated  visibility  flags  at  the  nodes  of  a  MRF  that  enforces  spatial  and  temporal  consistency  of  the  flags 
and  ensures  that  at  least  one  path  is  visible  at  every  pixel. 


be  the  average  distance  between  two  paths,  and  let  p*  be  the  controlling  path  at  vp(£).  Then,  the  observed 
visibility  vp{t)  is  defined  as  follows  (see  also  Figure [5]): 


Vp(t)  — 


1  if  dpp*  <  4  pixels 
0  otherwise 


(32) 


In  words,  a  path  p  is  observed  to  be  visible  at  vp(t)  when  it  either  coincides  with  (p  =  p*  so  that  dpp*  =  0) 
or  is  nearly  parallel  (dpp*  <  4)  to  the  controlling  path  p*  at  vp(t). 

The  observed  visibility  flags  vp(t)  influence  the  (hidden)  visibility  flags  vp(t)  through  a  data  term  in  the 
MRF.  Let 

A Ip(t)  =  ®(J(cp,£)  -  I(cp,Tp))  (33) 


be  the  same  per-path,  per- frame  measure  of  intensity  consistency  used  in  ([28]). 
average  measure  of  intensity  change  along  the  visible  portion  of  path  p: 


We  define  the  following 


Et=i  vp(t) 


(34) 


For  correctly  estimated  paths,  this  measure  reflects  variations  of  intensity  caused  by  unmodeled  effects  such 
as  image  noise  or  global  illumination  changes,  rather  than  by  occlusions.  Given  these  definitions,  the  data 
term  of  the  MRF  is  defined  as  follows: 


D(vp(t)  =  1)  =  A Ip(t)  +  Al(1  -  vp(t)) 
D{yv(t )  =  0)  =  Ap  +  A L&pit)  . 
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The  terms  with  multiplier  Xl  bias  estimated  visibility  values  vp(t)  toward  observed  values  vp(t).  Setting  a 
point  to  be  visible  incurs  the  additional  charge  A Ip(t),  equal  to  the  change  in  intensity  between  anchor  and 
current  point.  Setting  a  point  to  be  invisible  incurs  the  additional  charge  Ap  that  accounts  for  the  fact  that 
intensity  variations  may  be  caused  by  other  than  occlusions. 

The  weights  on  edges  between  the  random  variables  of  the  MRF  encourage  both  temporal  and  spatial 
consistency  among  visibility  values.  Specifically,  a  penalty 

v(vp(t),  up(t  +  1))  =  AT | vp(t)  -  vp(t  +  1)|  (36) 

is  added  between  temporally  adjacent  neighbors  to  discourage  changes  of  visibility  along  a  path.  The  weight 
on  an  edge  between  spatial  neighbors  is 

V(vp(t),  Vqit)^  =  \sWpq{t)\Vp(t)  -  Uq{t)\  (37) 


with 


Wpq{t) 


(  AIpq(t)+AIpq 

e  V 

dpq  +  e 


where  e  >  0  prevents  division  by  zero.  In  this  expression, 


AIpq(t)  —  (/(Cp,  t)  I(cq,t)) 
^Ipq  —  —  I(cqiTq))  • 


(38) 


(39) 


In  words,  A Ipq(t)  measures  difference  in  appearance  between  paths  in  a  single  frame,  and  A Ipq  measures  a 
similar  difference  between  anchor  points.  The  combined  effect  of  these  two  terms  is  to  push  discontinuities 
in  visibility  closer  to  intensity  boundaries,  and  the  division  by  dpq  reduces  the  spatial  discontinuity  penalty 
between  unrelated  paths. 

Finally,  we  clamp  enough  visibility  values  to  1  to  ensure  that  every  pixel  in  the  sequence  has  a  visible 
path  through  it.  Specifically,  we  make  all  anchor  points  visible,  vp{tp)  =  1,  and  we  also  force  vp(t)  =  1 
if  dpp*  <  \[2.  This  assignment  guarantees  that  at  least  one  visible  path  goes  through  every  pixel  because 
dp*p*  =  0.  We  roll  the  pairwise  cost  for  each  edge  incident  to  a  clamped  node  into  the  unary  cost  for  the 
other  node  of  that  edge. 


Computation  Preliminaries 

Before  we  solve  for  motion  and  visibility,  we  select  basis  paths  and  an  initial  set  of  anchors,  paths,  and 
visibility  flags  as  follows. 


Finding  the  basis  paths  Basis  paths  are  obtained  by  first  tracking  a  sparse  set  of  feature  points  with  a 
frame-to-frame  tracker  lf57).  This  yields  several  tracks ,  that  is,  paths  that  do  not  necessarily  extend  through 
the  entire  sequence.  These  tracks  are  supplemented  with  those  formed  by  concatenating  optical  flow  vectors 
between  consecutive  frames  ED,  as  described  in  more  detail  under  Initialization  below,  where  we  do  the 
same  to  initialize  a  dense  set  of  paths. 

For  some  sequences,  several  tracks  may  extend  from  first  to  last  frame.  PCA  can  then  yield  a  basis 
whose  size  K  is  determined  by  adding  principal  components  until  the  reconstruction  residual  for  the  input 
tracks  is  below,  e.g .,  2  pixels. 
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In  general,  however,  occlusions  and  tracking  failures  make  tracks  start  late  and  end  early,  leading  to 
a  matrix  of  track  coordinates  with  missing  entries.  We  iterate  between  matrix  factorization  with  missing 
data  1241  and  a  compaction  step  that  associates  tracks  corresponding  to  the  same  world  point  [64|.  If  needed, 
a  user  can  be  asked  to  correct  mistakes  in  data  association.  We  scale  path  coordinates  so  that  the  mean  per- 
path  motion  between  frames  is  one  pixel. 

Initialization  To  cover  every  pixel  in  a  video  sequence  with  paths,  we  need  to  create  a  number  of  paths  of 
the  same  order  of  the  number  of  visible  points  in  the  sequence.  To  this  end,  we  form  an  initial  set  of  paths  by 
placing  anchor  points  at  every  pixel  in  the  first  and  last  frames  in  the  sequence,  and  supplement  these  with 
additional  anchors  in  regions  that  are  not  yet  covered  by  some  path.  More  specifically,  we  start  by  defining 
path  fragments  we  call  temporal  superpixels  with  the  procedure  described  by  Sundaram  et  al.  f7Tl.  We  first 
concatenate  optical  vectors  into  tracks,  which  we  break  when  the  optical  flow  field  fails  a  forward-backward 
consistency  check  or  when  the  point  is  too  close  to  a  motion  boundary  (equations  (5)  and  (6)  from  11711. 
respectively).  To  prevent  merging  foreground  and  background  tracks,  we  create  a  thin  empty  buffer  around 
the  regions  where  tracks  terminate.  If  a  superpixel  thus  created  extends  over  several  frames,  we  replace  it 
by  an  entire  path  whose  coefficients  are  computed  by  projecting  the  superpixel  onto  the  path  basis.  If  the 
superpixel  is  too  short,  we  copy  the  coefficients  of  the  path  with  the  greatest  intensity  consistency  over  a 
few  frames  among  existing,  nearby,  parallel  paths.  The  temporal  extent  of  superpixels  provides  an  initial 
estimate  for  the  visibility  flags. 

Every  temporal  superpixel  that  extends  to  the  first  or  the  last  frame  yields  an  initial  path  anchored  in 
that  frame.  The  remaining  superpixels  are  said  to  be  covered  if  their  paths  differ  by  an  average  of  less  than 
two  pixels  per  frame  from  an  existing  path.  New  paths  are  formed  only  for  superpixels  that  are  not  yet 
covered,  and  their  anchor  points  are  placed  in  the  last  frame  of  the  superpixel.  Figure  [6a]  shows  the  anchor 
points  selected  in  this  way  for  the  marple7  sequence.  Colors  other  than  gray  are  anchors,  and  similar  colors 
correspond  to  similar  sets  of  path  coefficients. 

After  this  initialization  stage,  the  energy  functions  defined  earlier  are  minimized  by  the  algorithms  de¬ 
scribed  in  the  previous  Section.  This  can  result  in  the  insertion  of  additional  anchor  points.  Figure [6b]  shows 
the  color-coded  anchor  points  after  convergence. 

Optimization 

Starting  with  the  paths  and  visibility  flags  constructed  as  described  above,  we  interleave  two  steps  during 
optimization:  a  combinatorial  optimization  step  finds  visibility  flags  vp(t)  for  the  current  path  estimates, 
and  a  continuous  optimization  step  updates  path  coefficients  cp  given  the  current  visibility  estimates.  In  the 
process,  we  add  anchor  points  until  every  pixel  in  the  sequence  has  at  least  one  path  through  it,  and  remove 
anchors  of  invisible  paths.  We  stop  when  the  maximum  change  in  every  path  falls  below  one  pixel  in  every 
frame. 

The  initial  path  estimates  are  often  poor  along  occlusion  boundaries,  because  visibility  is  not  yet  ac¬ 
counted  for.  Because  of  this,  we  heuristically  regroup  paths  between  each  combinatorial  and  continuous 
step  to  let  foreground  and  background  vie  for  paths  between  them. 

We  now  describe  the  continuous  step,  path  regrouping,  combinatorial  step,  anchor  management,  and 
termination. 

Continuous  step.  We  update  path  coefficients  by  minimizing  the  energy  function  ([27])  via  trust-region 
Newton  Conjugate  Gradients  optimization  f62l.  This  method  only  requires  computing  vectors  of  the  form 
H v  where  H  is  the  Hessian,  rather  than  the  very  large  but  sparse  H  itself.  The  sparsity  pattern  of  H  changes 
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Figure  6:  Anchor  point  selected  during  initialization  (a)  and  at  convergence  (b).  Colors  other  than  gray 
denote  anchor  points,  and  similar  colors  denote  similar  sets  of  path  coefficients.  Note  the  improved  segmen¬ 
tation  of  Miss  Marple  after  convergence. 


over  time  because  the  coupling  coefficients  apq  in  equation  ([30])  depend  in  turn  on  the  path  coefficients. 
When  computing  successive  conjugate  gradients,  we  treat  the  terms  apq  as  constants — a  good  approximation 
for  small  path  perturbations — and  recompute  them  between  full  descent  steps. 

Path  regrouping.  After  40  descent  steps,  we  allow  paths  to  copy  their  coefficients  and  visibility  flags  from 
one  of  their  neighbors  if  doing  so  improves  the  path’s  fit  to  data.  Specifically,  path  p  copies  from  path  q  if 
VqiTp)  =  1,  rp  /  rq,  dpq(rp)  <  A,  path  q  is  visible  for  at  least  half  the  frames,  and  the  copy  improves  the 
data  fit  for  p  the  most.  Figure [7] illustrates  the  benefits  of  this  step  on  the  marple7  sequence. 

Combinatorial  step.  Visibility  flags  are  updated  after  path  regrouping  by  using  graph  cuts  (2Ql  52]  to 
compute  the  MAP  estimate  for  the  MRF  defined  earlier.  The  energy  function  is  amenable  to  this  method  as 


(a)  Without  regrouping.  (b)  With  regrouping. 

Figure  7:  Effect  of  path  regrouping.  Motion  estimates  are  shown  using  the  same  color  scheme  as  in  Figure[6] 
The  first  image  in  each  pair  shows  the  solution  after  the  first  round  of  optimization;  the  second  shows  results 
at  convergence.  Regrouping  (|b])  recovers  from  a  poor  local  optimum  with  incorrect  estimates  for  the  motion 
of  the  occluded  background. 
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Sequence 

Method 

APIE 

Path  length  (frames) 

Mean  Std.  dev. 

Path  density  (pixels) 

50th  95th  99th 

%  pixels  containing 
visible  paths 

LDOF  traj. 

13.97 

11.2 

10.5 

0.47 

8.5 

15.2 

66.6% 

Flowerbed 

LME 

9.37 

23.9 

7.3 

0.31 

0.79 

1.3 

97.5% 

Ours 

6.10 

23.3 

7.2 

0.29 

0.66 

0.84 

99.8% 

LDOF  traj. 

19.74 

6.8 

7.5 

1.2 

47.0 

70.0 

47.3% 

Truck 

LME 

17.82 

23.4 

7.4 

0.39 

1.9 

4.4 

88.6% 

Ours 

9.80 

22.0 

9.0 

0.27 

0.65 

0.86 

99.8% 

LDOF  traj. 

7.84 

6.7 

6.4 

0.47 

6.9 

13.3 

69.0% 

Marple7 

LME 

6.28 

15.9 

7.5 

0.43 

5.7 

9.7 

76.0% 

Ours 

5.61 

15.5 

6.8 

0.32 

0.70 

0.89 

99.7% 

Marplel 

LDOF  traj. 

15.83 

9.4 

11.4 

0.51 

14.7 

26.7 

62.5  % 

Ours 

8.79 

18.9 

19.3 

0.35 

0.85 

1.0 

97.8% 

Marple8 

LDOF  traj. 

13.69 

14.9 

14.7 

0.47 

7.4 

16.4 

71.2% 

Ours 

9.30 

45.8 

20.8 

0.25 

0.65 

0.87 

99.7% 

Table  6:  Solution  quality  metrics.  APIE  measures  average  intensity  constancy  along  estimated  paths  (smaller 
is  better,  assuming  the  brightness  constancy  assumption  holds).  Path  length  is  the  number  of  frames  in  which 
a  path  is  reported  as  visible  (longer  is  typically  better).  Path  density  is  computed  by  measuring  the  distance 
to  the  nearest  visible  path  for  each  pixel.  We  report  the  50th,  95th,  and  99th  percentiles  (smaller  is  better),  as 
well  as  the  percentage  of  pixels  with  a  visible  path  within  a  radius  of  1  pixel  (larger  is  better).  The  marplel 
and  marple8  sequences  were  too  large  for  LME  to  complete  within  a  reasonable  timeframe.  We  stopped 
computation  after  72  hours  when  only  a  single  iteration  had  completed. 


the  edge  costs  ([37])  satisfy 


v(o,o)  + v(i,i)  <  v(o,i)  + v(i,o) 

0  <  V(0,1)  +  V(1,0)  .  v‘ 

Anchor  management.  When  the  maximum  change  in  any  path  in  any  frame  is  less  than  one  pixel,  we  check 
that  every  pixel  in  the  video  has  a  visible  path  through  it.  If  not,  we  add  new  anchor  points  to  fill  voids  and 
resume  optimization.  Newly  inserted  paths  copy  their  initial  parameters  from  the  closest  visible  path. 

Anchors  on  paths  that  are  invisible  everywhere  except  at  the  anchor  itself  (which  is  always  visible) 
are  deleted.  These  one-point  paths  occur  when  visibility  estimation  correctly  identifies  an  outlier  with  an 
incorrect  path  estimate. 

Termination.  Optimization  terminates  when  all  path  estimates  change  by  less  than  a  pixel  in  every  frame 
and  all  pixels  in  the  video  have  a  path  through  them. 

Results 

We  evaluate  the  performance  of  our  technique  on  five  real  sequences  of  increasing  complexity,  all  with  large 
motions  and  significant  occlusions.  The  popular  flowerbed  (29  frames)  and  a  new  sequence  with  a  truck 
driving  behind  a  road  sign  (33  frames)  contain  only  rigid  motion.  The  three  with  non-rigid  motion  are  from 
the  Berkeley  motion  segmentation  dataset  ll23ll :  60  frames  from  marplel ,  72  frames  from  marple8 ,  and 
25  frames  from  marple7.  The  marple7  and  flowerbed  sequences  are  the  same  as  those  evaluated  in  LME. 
Figure  [8]  shows  sample  frames. 
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(e)  Marple8.  Eight  basis  functions;  68  hours  (72  frames). 


Figure  8:  Results  of  our  method.  For  each  sequence,  we  show  the  first  and  last  frames,  followed  by  the  last 
frame  warped  to  align  with  the  first  frame,  and  vice  versa.  Regions  detected  as  occluded  in  the  source  frame 
of  the  warp  are  marked  in  white.  Solution  times  (rounded  to  nearest  half-hour)  exclude  basis  computation. 
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Qualitative  evaluation  For  a  qualitative  evaluation,  we  use  our  motion  results  to  warp  all  frames  to  a 
selected  frame.  This  creates  a  motion-compensated  video  that  should  appear  static  except  for  regions  that 
are  occluded  in  a  particular  frame,  and  that  we  paint  white.  Figure  [8]  shows  the  last  frame  aligned  to  the  first 
frame,  and  viceversa,  for  all  sequences. 

Quantitative  evaluation  It  is  difficult  to  get  reliable  ground  truth  paths  for  realistic  sequences.  Synthetic 
datasets  |[25ll  do  not  preserve  associations  across  occlusions.  Manual  labeling  for  real  sequences  is  painstak¬ 
ing  and  unreliable,  particularly  for  complex  motions  or  low-texture  regions. 

Instead,  we  measure  the  degree  to  which  intensities  remains  constant  along  computed  paths  as  a  proxy 
for  performance.  We  define  the  all-path  interpolation  error  (APIE) 


APIE  = 


N 


1  Y,tMt)(I(cP’t)  -^cp^))2 
n  p  up(t) 


(41) 


where  N  is  the  number  of  paths.  Even  on  perfect  paths,  APIE  would  measure  the  correctness  of  the  bright¬ 
ness  constancy  assumption,  and  would  be  nonzero  in  general. 

Table  [6]  reports  the  APIE  for  different  methods  for  each  sequence,  computed  with  intensity  values  in 
[0,  255].  LDOF  trajectories  do  not  directly  report  visibility;  the  correspondences  for  trajectories  after  occlu¬ 
sions  are  simply  missing.  These  entries  are  ignored  as  if  they  had  vp(t)  =  0.  We  use  the  location  and  frame 
of  the  first  observation  of  each  trajectory  as  its  reference  appearance.  LME  paths  are  anchored  either  in  the 
first  or  last  frame  and  do  report  visibility  values. 

We  aim  to  recover  paths  that  maintain  correspondence  across  occlusions.  We  measure  our  success  by 
analyzing  the  average  length  of  a  path,  defined  as  jr  J2P  up(t)-  As  can  be  seen  in  Table  [6j  the  two 
methods  that  estimate  visibility  (our  method  and  LME)  return  significantly  longer  paths  on  average  as  the 
result  of  their  ability  to  detect  disocclusions.  Further,  the  average  length  of  our  paths  tends  to  correspond  to 
the  length  of  the  dominant  occluder. 

A  key  feature  of  our  algorithm  is  the  ability  to  compute  the  path  for  every  visible  point  in  a  scene.  We 
measure  path  density  by  computing  the  distance  to  the  closest  visible  path  for  each  pixel  in  the  sequence. 
Table  [6] reports  the  50th,  95th,  and  99th  percentile  for  each  method,  as  well  as  the  total  percentage  of  pixels 
with  a  visible  path  within  a  distance  of  1  pixel.  LDOF  trajectories  leave  many  pixels  unexplained  because 
they  are  not  initialized  in  low-texture  areas.  LME  misses  objects  not  visible  in  either  the  first  or  last  frame 
of  a  sequence.  In  many  sequences,  these  missed  objects  can  account  for  a  significant  fraction  of  the  scene. 
Our  method  explains  over  97%  of  the  pixels  in  every  sequence. 


Parameter  sensitivity  Our  technique  uses  a  few  parameters  that  could  be  tuned  if  desired.  We  selected 
settings  for  the  parameters  by  hand  considering  the  results  on  the  flowerbed  sequence  only,  and  used  the 
same  values  for  all  five  sequences.  We  set  A  =  1,  a  =  50,  and  \l  =  At  =  A^  =  0.5.  We  re-scale  intensity 
values  to  [0, 1]  for  the  combinatorial  optimization  step  to  match  the  range  of  the  binary  unknowns.  In  our 
experiments,  we  found  that  results  were  relatively  insensitive  to  small  changes  in  the  values  of  A  or  a,  but 
were  more  sensitive  to  the  values  of  the  parameters  for  the  occlusion  detection  step. 
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